From owner-freebsd-arch@FreeBSD.ORG  Sun May 27 16:22:12 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 0AA2F1065672
	for <freebsd-arch@freebsd.org>; Sun, 27 May 2012 16:22:12 +0000 (UTC)
	(envelope-from rmh.aybabtu@gmail.com)
Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com
	[209.85.214.182])
	by mx1.freebsd.org (Postfix) with ESMTP id C361C8FC19
	for <freebsd-arch@freebsd.org>; Sun, 27 May 2012 16:22:11 +0000 (UTC)
Received: by obcni5 with SMTP id ni5so5488833obc.13
	for <freebsd-arch@freebsd.org>; Sun, 27 May 2012 09:22:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	bh=2T4M2yrOSJB4ztn5XRBiqwoJgEhsYzJ4rbeK6W7Rbl4=;
	b=AXJb4ux4hA7pq+pbGXPssQ7zwQ3fne9MiN4Dis3JYxgF6KUw9ZpmYF3ajeFshEf62x
	JshPco9ZvprtQch60RaQE/ED27skEMhGGkzCl0E6JSuryhhyLNvEmrP8OkY3md9is1wz
	/L/qCWEtDZGzezYXBNcdE0TtRieeVTZWK+bbPyfR8qqs/elzBe2W3FCqDgBO8DMtDlg1
	Q33DFs4qz2UGT7RpC/zmqkZUrKMxzG6kbpeYD8cOK69xMUiU4aVoTTM7m7S4ZsNV6phk
	5qLuTcxWVwbBdS7MteVU8j4wP1jzt39JwRBDfGlZY9N3OkXydnIq6HhKQQKPhx+I8NIm
	DT4g==
MIME-Version: 1.0
Received: by 10.50.149.129 with SMTP id ua1mr2776036igb.43.1338135731004; Sun,
	27 May 2012 09:22:11 -0700 (PDT)
Sender: rmh.aybabtu@gmail.com
Received: by 10.42.202.84 with HTTP; Sun, 27 May 2012 09:22:10 -0700 (PDT)
In-Reply-To: <20120519134005.GJ2358@deviant.kiev.zoral.com.ua>
References: <CAOfDtXPidEVGHDeZWTQyk-X6pabc0HBqWLdNJG_zRgX=7iKgWg@mail.gmail.com>
	<20120519134005.GJ2358@deviant.kiev.zoral.com.ua>
Date: Sun, 27 May 2012 18:22:10 +0200
X-Google-Sender-Auth: vlHsJ8bKVAiH4YZpZsNV5zuZeSc
Message-ID: <CAOfDtXOZe5vkoBHbZtmS4puYOM9sYR7=JVuOXS4kxnCHK6wSKA@mail.gmail.com>
From: Robert Millan <rmh@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: freebsd-arch@freebsd.org
Subject: Re: headers that use "struct bintime"
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 May 2012 16:22:12 -0000

2012/5/19 Konstantin Belousov <kostikbel@gmail.com>:
>> sys/arm/include/cpu.h
>> sys/dev/iscsi/initiator/iscsivar.h
>> sys/geom/journal/g_journal.h
>> sys/sys/dtrace_bsd.h
>> sys/sys/devicestat.h
>> sys/sys/timeet.h
>> sys/sys/bio.h
>> sys/opencrypto/cryptodev.h
>>
> Note that all headers you listed are kernel headers, and kernel is exposed
> to the whole namespace. I suspect that no headers are supposed to be used
> by usermode among the list.

There's at least one case (sys/devicestat.h) which is widely exposed
to userland:

lib/libdevstat/devstat.h:#include <sys/devicestat.h>
lib/libgeom/geom_stats.c:#include <sys/devicestat.h>
usr.bin/kdump/ioctl.c:#include <sys/devicestat.h>
sbin/mdconfig/mdconfig.c:#include <sys/devicestat.h>

and also into sys/cam/, some of which is in userland too (built into libcam):

sys/cam/ata/ata_pmp.c:#include <sys/devicestat.h>
sys/cam/ata/ata_da.c:#include <sys/devicestat.h>
sys/cam/scsi/scsi_pt.c:#include <sys/devicestat.h>
sys/cam/scsi/scsi_pass.c:#include <sys/devicestat.h>
sys/cam/scsi/scsi_targ_bh.c:#include <sys/devicestat.h>
sys/cam/scsi/scsi_sa.c:#include <sys/devicestat.h>
sys/cam/scsi/scsi_da.c:#include <sys/devicestat.h>
sys/cam/scsi/scsi_target.c:#include <sys/devicestat.h>
sys/cam/scsi/scsi_sg.c:#include <sys/devicestat.h>
sys/cam/scsi/scsi_cd.c:#include <sys/devicestat.h>
sys/cam/cam_periph.c:#include <sys/devicestat.h>

-- 
Robert Millan

From owner-freebsd-arch@FreeBSD.ORG  Sun May 27 18:01:08 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 7B2291065677;
	Sun, 27 May 2012 18:01:08 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au
	[211.29.132.188])
	by mx1.freebsd.org (Postfix) with ESMTP id EB0708FC28;
	Sun, 27 May 2012 18:01:07 +0000 (UTC)
Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232])
	by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q4RI0x4E019202
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Mon, 28 May 2012 04:01:00 +1000
Date: Mon, 28 May 2012 04:00:59 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Robert Millan <rmh@FreeBSD.org>
In-Reply-To: <CAOfDtXOZe5vkoBHbZtmS4puYOM9sYR7=JVuOXS4kxnCHK6wSKA@mail.gmail.com>
Message-ID: <20120528023818.F2417@besplex.bde.org>
References: <CAOfDtXPidEVGHDeZWTQyk-X6pabc0HBqWLdNJG_zRgX=7iKgWg@mail.gmail.com>
	<20120519134005.GJ2358@deviant.kiev.zoral.com.ua>
	<CAOfDtXOZe5vkoBHbZtmS4puYOM9sYR7=JVuOXS4kxnCHK6wSKA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Konstantin Belousov <kostikbel@gmail.com>, freebsd-arch@FreeBSD.org
Subject: Re: headers that use "struct bintime"
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 May 2012 18:01:08 -0000

On Sun, 27 May 2012, Robert Millan wrote:

> 2012/5/19 Konstantin Belousov <kostikbel@gmail.com>:
>>> sys/arm/include/cpu.h
>>> sys/dev/iscsi/initiator/iscsivar.h
>>> sys/geom/journal/g_journal.h
>>> sys/sys/dtrace_bsd.h
>>> sys/sys/devicestat.h
>>> sys/sys/timeet.h
>>> sys/sys/bio.h
>>> sys/opencrypto/cryptodev.h
>>>
>> Note that all headers you listed are kernel headers, and kernel is exposed
>> to the whole namespace. I suspect that no headers are supposed to be used
>> by usermode among the list.
>
> There's at least one case (sys/devicestat.h) which is widely exposed
> to userland:

devstat has silly APIs, with long double values (despite needing the
range and precision of long doubles over doubles less than most things)
and bintimes (despite needing the precision of bintimes over less than
most things), but is well established so is hard to fix now.

> lib/libdevstat/devstat.h:#include <sys/devicestat.h>
> lib/libgeom/geom_stats.c:#include <sys/devicestat.h>
> usr.bin/kdump/ioctl.c:#include <sys/devicestat.h>

devicestat.h doesn't even have any ioctls in it, but it is apparently
needed here to supply pollution in other headers.  Perhaps it is no
longer even needed here.  In FreeBSD-4, it was needed for the include
of <sys/ccdvar.h>, which does have ioctls in it and also has a
complete struct devstat in its softc.  Headers with softcs in them
never belonged in <sys>, and softcs never belonged in public APIs,
and at least the <sys> part of this has been fixed -- <sys/ccdvar.h>
no longer exists, and the only struct devstats in <sys> are an
incomplete one in bio.h and of course the full one in devicestat.h.
ccd was from NetBSD, and has gone away completely.  FreeBSD used
mainly vn at first, then md.

I forget if you need bintimes to work when !__BSD_VISIBLE, or just
need the headers to be self-sufficient.  devicestat.h already has
the latter.  It includes sys/time.h.  The early include of sys/time.h
in kdump/ioctl.c has no effect, since it is after the include of
sys/devicestat.h where sys/time.h has already been included nested.
OTOH, in the kernel the include of sys/time.h in sys/devicestat.h
normally has no effect, since normally sys/param.h is included earlier
and it supplies sys/time.h as standard pollution.

> sbin/mdconfig/mdconfig.c:#include <sys/devicestat.h>

/dev/md's softc is ugly and includes a struct devstat, but its ugliness
is not public (it is only in dev/md/md.c).  Otherwise it would give
the same problem as ccd used to.

> and also into sys/cam/, some of which is in userland too (built into libcam):
>
> sys/cam/ata/ata_pmp.c:#include <sys/devicestat.h>
> sys/cam/ata/ata_da.c:#include <sys/devicestat.h>
> sys/cam/scsi/scsi_pt.c:#include <sys/devicestat.h>
> sys/cam/scsi/scsi_pass.c:#include <sys/devicestat.h>
> sys/cam/scsi/scsi_targ_bh.c:#include <sys/devicestat.h>
> sys/cam/scsi/scsi_sa.c:#include <sys/devicestat.h>
> sys/cam/scsi/scsi_da.c:#include <sys/devicestat.h>
> sys/cam/scsi/scsi_target.c:#include <sys/devicestat.h>
> sys/cam/scsi/scsi_sg.c:#include <sys/devicestat.h>
> sys/cam/scsi/scsi_cd.c:#include <sys/devicestat.h>

I wonder why scsi is still doing so much with devstat.  For disks,
devstat handling mostly moved into geom.  In the old ata driver,
there are no remaining reference to struct devstat or devicestat.h
in any disk driver.  This works for ad, but IIRC device statistics
were broken for a long time for acd.  Non-disk drivers that don't
go through geom must still do their own devstat handling.  Scsi
drivers always did, and the above shows them still doing it, but
ata_da, da and cd are disk drivers so why do they do it?  Old ata
drivers mostly didn't, so device statistics never worked for them.
Now, only the ata tape driver does its own device statistics, as
it always did.

All of the above are .c files and all of the struct devstats in their
softc's are private, so they don't affect userland.

The struct devstats in the above are:

scsi/scsi_ch.c:	struct devstat	*device_stats;
scsi/scsi_pass.c:	struct devstat	*device_stats;
scsi/scsi_pt.c:	struct	 devstat *device_stats;
scsi/scsi_sa.c:	struct		devstat *device_stats;
scsi/scsi_sg.c:	struct devstat		*device_stats;
scsi/scsi_targ_bh.c:	struct		devstat device_stats;
scsi/scsi_target.c:	struct devstat		 device_stats;

The cam/ata files don't use any devstat functions, so the includes of
devicestat.h in them are apparently bogus.

cd and sa do use devstat functions, so the includes of devicestat.h
in them are needed, but they don't appear in the previous list because
they put the devstat struct in a general disk/geom struct instead of
in their softc.

scsi_ch.c includes devicestat.h and needs it but is not in your list.
This and the others not already described must do their own devstat
handling since they are not disks.  All except the targ* ones only
use an incomplete struct device_stats (a pointer to a full one).  For
disk drivers, this indirection helps avoid polluting public disk
headers with the complete declaration.  It might not be useful here,
but all the non-targ* scsi drivers do it.  In FreeBSD-4 (before geom),
_all_ scsi drivers used a complete device_stats struct in their softc.

Oops, I forgot the libcam use of kernel files.  It seems to use only
scsi_da.c and scsi_sa.c from the above.  Both of these use only an
indirect struct device_stats in their softc.  They can't reasonably
access this in userland, and it is clear that scsi_da.c tries not to
since only includes devicestat.h under a _KERNEL ifdef.  scsi_sa.c
is not so careful.  I checked this in the userland .depend.
devicestat.h is only depended on by scsi_sa.c.  Hopefully it can be
fixed in the same way as scsi_da.c was (I guess the latter does less
with devstat because more is done in geom).

Thus there seems to be no real need for devicestat.h from scsi files
userland.

The public interface for libdevstat is <devstat.h>.  This shouldn't
export struct device_stat, but it includes <sys/devicestat.h> and
struct device_stat isn't in the _KERNEL section there.  Even the
libdevstat implementation never references struct device_stat, so
it seems to be pure pollution in libdevstat.  I think libdevstat
just uses a sysctl that converts kernel struct devicestats into
userland struct devstat, so userland should never see the former.
However, bintimes are part of the public API (they are in struct
devstat and a couple of functions).

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Mon May 28 06:36:12 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 83699106564A;
	Mon, 28 May 2012 06:36:12 +0000 (UTC)
	(envelope-from phk@phk.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id 3E6DB8FC0A;
	Mon, 28 May 2012 06:36:12 +0000 (UTC)
Received: from critter.freebsd.dk (critter-phk.freebsd.dk [192.168.48.2])
	by phk.freebsd.dk (Postfix) with ESMTP id E37A0139C3;
	Mon, 28 May 2012 06:36:04 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id q4S6a2i8022721;
	Mon, 28 May 2012 06:36:03 GMT (envelope-from phk@phk.freebsd.dk)
To: Bruce Evans <brde@optusnet.com.au>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Mon, 28 May 2012 04:00:59 +1000."
	<20120528023818.F2417@besplex.bde.org>
Content-Type: text/plain; charset=ISO-8859-1
Date: Mon, 28 May 2012 06:36:02 +0000
Message-ID: <22720.1338186962@critter.freebsd.dk>
Cc: Konstantin Belousov <kostikbel@gmail.com>, freebsd-arch@FreeBSD.org,
	Robert Millan <rmh@FreeBSD.org>
Subject: Re: headers that use "struct bintime"
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 May 2012 06:36:12 -0000

In message <20120528023818.F2417@besplex.bde.org>, Bruce Evans writes:

>devstat has silly APIs, with long double values (despite needing the
>range and precision of long doubles over doubles less than most things)
>and bintimes (despite needing the precision of bintimes over less than
>most things), but is well established so is hard to fix now.

I take it you meant to write:  devstat wisely chose sufficiently
powerful data types to cover a significant of stretch of future,
rather than fall in the all too common trap of skimping on data
types for the sake of a few bytes ?

>Headers with softcs in them
>never belonged in <sys>, and softcs never belonged in public APIs,

Well, that's pretty much how ioctls worked on very early versions
of Unix, but I agree with you that it was architecturally wrong.

>I wonder why scsi is still doing so much with devstat.  For disks,
>devstat handling mostly moved into geom.

Mainly because SCSI also does tapes, and non-GEOM operations, such
as formatting, on disks.

I should have severed that link when I did GEOM, splitting
devicestat into geomstat and camstat.

Doing so now would still be a good idea.

>I think libdevstat
>just uses a sysctl that converts kernel struct devicestats into
>userland struct devstat, so userland should never see the former.

It mmaps /dev/devstat, so it is slightly more tangled.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Mon May 28 17:11:27 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id B6B29106566C;
	Mon, 28 May 2012 17:11:27 +0000 (UTC)
	(envelope-from freebsd-listen@fabiankeil.de)
Received: from smtprelay06.ispgateway.de (smtprelay06.ispgateway.de
	[80.67.31.101])
	by mx1.freebsd.org (Postfix) with ESMTP id 744698FC0A;
	Mon, 28 May 2012 17:11:27 +0000 (UTC)
Received: from [78.35.185.129] (helo=fabiankeil.de)
	by smtprelay06.ispgateway.de with esmtpsa (TLSv1:AES128-SHA:128)
	(Exim 4.68) (envelope-from <freebsd-listen@fabiankeil.de>)
	id 1SZ3OF-0001HQ-Mj; Mon, 28 May 2012 19:05:55 +0200
Date: Mon, 28 May 2012 19:03:00 +0200
From: Fabian Keil <freebsd-listen@fabiankeil.de>
To: gnn@freebsd.org
Message-ID: <20120528190300.3a43fc8d@fabiankeil.de>
In-Reply-To: <86wr40tfhf.wl%gnn@neville-neil.com>
References: <86wr40tfhf.wl%gnn@neville-neil.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=PGP-SHA1;
	boundary="Sig_/JwiCtaM=HFd/B+UCSuQaqjH";
	protocol="application/pgp-signature"
X-Df-Sender: Nzc1MDY3
Cc: arch@freebsd.org
Subject: Re: RFC: A trial io provider for DTrace...
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 May 2012 17:11:27 -0000

--Sig_/JwiCtaM=HFd/B+UCSuQaqjH
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

gnn@freebsd.org wrote:

> I have just put up the first patch that can give you something similar
> to the io provider in DTrace.  The patch is against HEAD of about a
> week ago.
>=20
> You can find the patch here: freebsd.org:
>=20
> http://people.freebsd.org/~gnn/dtio_provider.diff
>=20
> Note that you need to create a src/sys/modules/dtrace/dtio/ directory
> for this patch, since patch doesn't seem to create directories for me.

Worked for me when applying with -p0.

> The arguments are not exactly the same as in Solaris, for instance I
> don't yet support the fileinfo_t, but, you can get to the devstat and
> bio structures via args[0] and args[1] respectively.
>=20
> Here is an example of it working:
>=20
> dtrace -n 'io:::start /args[0] !=3D 0/{ trace(args[0]->bio_bcount)}'
>=20
> Remember you need to be root to use DTrace.

Do you intent to eventually commit your patch to get dtrace working
with sudo? I've been using it since you posted it last October and
haven't seen any issues.
http://lists.freebsd.org/pipermail/freebsd-current/2011-October/028120.html

> I need to clean this up and get the translators working properly
> before I can check this in.
>=20
> Also, note that this patch doesn't catch all I/O, but should get most
> of it, as it's hooked into the devstat system.
>=20
> I will be adding manual pages for the internals of DTrace to our
> section 9, as well as, hopefully, writing up a wiki page on how to add
> your own kernel providers.
>=20
> Comments welcome.

I got:

clang -c -O2 -pipe -fno-strict-aliasing  -std=3Dc99 -g -Wall -Wredundant-de=
cls -Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes -Wpointer-ar=
ith -Winline -Wcast-qual
    /usr/src/sys/kern/subr_devstat.c:390:2: error: use of undeclared identi=
fier 'bs'
            DTRACE_DEVSTAT_BIO_DONE();
            ^
    /usr/src/sys/kern/subr_devstat.c:76:41: note: expanded from macro 'DTRA=
CE_DEVSTAT_BIO_DONE'
                    (*dtrace_io_done_probe)(dtio_done_id, bs, ds);
                                                          ^
    1 error generated.
    *** Error code 1

    Stop in /usr/obj/usr/src/sys/ZOEY.
    *** Error code 1

    Stop in /usr/src.
    *** [buildkernel] Error code 1

and used the following patch to get it to compile:

diff --git a/sys/kern/subr_devstat.c b/sys/kern/subr_devstat.c
index e2b6d21..732bf9c 100644
--- a/sys/kern/subr_devstat.c
+++ b/sys/kern/subr_devstat.c
@@ -73,7 +73,7 @@ uint32_t      dtio_wait_done_id;

 #define DTRACE_DEVSTAT_BIO_DONE() \
        if (dtrace_io_done_probe !=3D NULL) \
-               (*dtrace_io_done_probe)(dtio_done_id, bs, ds);
+               (*dtrace_io_done_probe)(dtio_done_id, bp, ds);

Other than that the provider seems to work fine so far.

Thanks a lot.

Fabian

--Sig_/JwiCtaM=HFd/B+UCSuQaqjH
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iEYEARECAAYFAk/Dr8sACgkQBYqIVf93VJ0JFgCfTDujAxJajr4079QCmroHgWsl
wDcAn12ctyO6y8hGFhLA1RwxzoB4TNNj
=Jfaw
-----END PGP SIGNATURE-----

--Sig_/JwiCtaM=HFd/B+UCSuQaqjH--

From owner-freebsd-arch@FreeBSD.ORG  Mon May 28 19:56:02 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4C4111065674;
	Mon, 28 May 2012 19:56:02 +0000 (UTC) (envelope-from dim@FreeBSD.org)
Received: from tensor.andric.com (cl-327.ede-01.nl.sixxs.net
	[IPv6:2001:7b8:2ff:146::2])
	by mx1.freebsd.org (Postfix) with ESMTP id 095118FC0C;
	Mon, 28 May 2012 19:55:58 +0000 (UTC)
Received: from [IPv6:2001:7b8:3a7:0:4c1c:92fb:538c:83ed] (unknown
	[IPv6:2001:7b8:3a7:0:4c1c:92fb:538c:83ed])
	(using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
	(No client certificate requested)
	by tensor.andric.com (Postfix) with ESMTPSA id 2E7C65C59;
	Mon, 28 May 2012 21:55:58 +0200 (CEST)
Message-ID: <4FC3D84B.2060302@FreeBSD.org>
Date: Mon, 28 May 2012 21:55:55 +0200
From: Dimitry Andric <dim@FreeBSD.org>
Organization: The FreeBSD Project
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
	rv:13.0) Gecko/20120522 Thunderbird/13.0
MIME-Version: 1.0
To: Baptiste Daroussin <bapt@FreeBSD.org>
References: <20120526235510.GB90668@ithaqua.etoilebsd.net>
In-Reply-To: <20120526235510.GB90668@ithaqua.etoilebsd.net>
X-Enigmail-Version: 1.5a1pre
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: arch@FreeBSD.org
Subject: Re: switch tounconditionnal boostrapping while to build the tree
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 May 2012 19:56:02 -0000

On 2012-05-27 01:55, Baptiste Daroussin wrote:
> After I replace yacc(1) by byacc(1) on current, we discovered than now it is
> impossible to build 9 on current, because byacc(1) is not 100% backward
> compatible with our yacc(1). this is because building a boostrap yacc(1) is
> conditionned on the version of the host that is building world.
> 
> Looking at Makefile.inc1 I can see that lots of tools are conditionned like
> this. I think if we want to go to be able to cross build the tree (I remember
> from EuroBSDcon that this is something we want to do) then we need to remove the
> conditions and always boostrap any tool necessary to be able to build the tree.
> 
> so if no one care I'll remove the condition to boostrap at least yacc(1) and
> lex(1) on current, 9, 8 and 7.
> 
> Would be great imho to do the same for any tools needed by the build system.

It could prevent a lot of subtle (and not to subtle :) problems, but it
will also waste a lot of CPU time and energy building stuff that isn't
strictly needed.  (I'm saying this with tongue in cheek, since I'm
responsible for a lot of CPU wastage, a.k.a. clang... ;)

E.g., the bootstrapping version check mechanism which is now in place,
is really a build time optimization, comparable to running builds with
NO_CLEAN: you can shoot yourself in the foot, it's dirty, but it works
most of the time, and it is *much* faster.

I really would not want to throw all that away.  But as a compromise,
you could add an option to do "brute force bootstrapping", which ignores
all version checking, and just builds all required bootstrap tools.

The question is also what your end goal is: do you want to reach a
NetBSD style approach (basically bootstrap *everything*), or just make
the current implementation more robust?

From owner-freebsd-arch@FreeBSD.ORG  Mon May 28 20:07:09 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 88825106566B;
	Mon, 28 May 2012 20:07:09 +0000 (UTC) (envelope-from des@des.no)
Received: from smtp.des.no (smtp.des.no [194.63.250.102])
	by mx1.freebsd.org (Postfix) with ESMTP id 4A6D78FC14;
	Mon, 28 May 2012 20:07:09 +0000 (UTC)
Received: from ds4.des.no (smtp.des.no [194.63.250.102])
	by smtp.des.no (Postfix) with ESMTP id 445016171;
	Mon, 28 May 2012 20:07:08 +0000 (UTC)
Received: by ds4.des.no (Postfix, from userid 1001)
	id 08F7B8DB0; Mon, 28 May 2012 22:07:07 +0200 (CEST)
From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
To: Baptiste Daroussin <bapt@FreeBSD.org>
References: <20120526235510.GB90668@ithaqua.etoilebsd.net>
Date: Mon, 28 May 2012 22:07:07 +0200
In-Reply-To: <20120526235510.GB90668@ithaqua.etoilebsd.net> (Baptiste
	Daroussin's message of "Sun, 27 May 2012 01:55:10 +0200")
Message-ID: <86aa0s83ys.fsf@ds4.des.no>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Cc: arch@FreeBSD.org
Subject: Re: switch tounconditionnal boostrapping while to build the tree
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 May 2012 20:07:09 -0000

Baptiste Daroussin <bapt@FreeBSD.org> writes:
> so if no one care I'll remove the condition to boostrap at least
> yacc(1) and lex(1) on current, 9, 8 and 7.

I should interject that I've already added code to Makefile.inc1 in 7, 8
and 9 so yacc is a bootstrap tool when building on a system that has
byacc, so right now Baptiste's proposed change won't make much
difference, but we should definitely give some thought to what we
consider a bootstrap tool and when we should build them.  Blindly
removing all conditionals is *not* an option, though, as some of the
bootstrap tools take ages to build.  Remember that all bootstrap tools
and build tools are built twice, once during the bootstrap / toolchain
phase and once during the everything phase.

DES
--=20
Dag-Erling Sm=C3=B8rgrav - des@des.no

From owner-freebsd-arch@FreeBSD.ORG  Wed May 30 12:12:03 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 73805106566B;
	Wed, 30 May 2012 12:12:03 +0000 (UTC)
	(envelope-from johnandsara2@cox.net)
Received: from eastrmfepo203.cox.net (eastrmfepo203.cox.net [68.230.241.218])
	by mx1.freebsd.org (Postfix) with ESMTP id EDD938FC0C;
	Wed, 30 May 2012 12:12:02 +0000 (UTC)
Received: from eastrmimpo110.cox.net ([68.230.241.223])
	by eastrmfepo203.cox.net
	(InterMail vM.8.01.04.00 201-2260-137-20101110) with ESMTP id
	<20120530121202.GPGW18532.eastrmfepo203.cox.net@eastrmimpo110.cox.net>;
	Wed, 30 May 2012 08:12:02 -0400
Received: from [192.168.3.22] ([70.177.172.35])
	by eastrmimpo110.cox.net with bizsmtp
	id GCC11j00Q0mAvba02CC288; Wed, 30 May 2012 08:12:02 -0400
X-CT-Class: Clean
X-CT-Score: 0.00
X-CT-RefID: str=0001.0A020208.4FC60E92.008B,ss=1,re=0.000,fgs=0
X-CT-Spam: 0
X-Authority-Analysis: v=1.1 cv=s1i2RV+unmn3sLkEA3lf1Tj2LikDbZyRf9iEFo2x6J8=
	c=1 sm=1 a=f5xKl4ys9bwA:10 a=_shUJCvoDt8A:10 a=G8Uczd0VNMoA:10
	a=Wajolswj7cQA:10 a=8nJEP1OIZ-IA:10 a=alU6Bxxa4qBWIf+k8j/ISQ==:17
	a=HzI0Pm0Nd4Mf1Gaf9u8A:9 a=wPNLvfGTeEIA:10
	a=alU6Bxxa4qBWIf+k8j/ISQ==:117
X-CM-Score: 0.00
Authentication-Results: cox.net; none
Message-ID: <4FC60E8C.1070204@cox.net>
Date: Wed, 30 May 2012 08:11:56 -0400
From: "John D. Hendrickson and Sara Darnell" <johnandsara2@cox.net>
User-Agent: Thunderbird 2.0.0.24 (X11/20100228)
MIME-Version: 1.0
To: Baptiste Daroussin <bapt@FreeBSD.org>
References: <20120526235510.GB90668@ithaqua.etoilebsd.net>
In-Reply-To: <20120526235510.GB90668@ithaqua.etoilebsd.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: arch@FreeBSD.org
Subject: Re: switch tounconditionnal boostrapping while to build the tree
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: johnandsara2@cox.net
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 May 2012 12:12:03 -0000

i find the statements hard to believe

why are you doing it that way ?  (using a broken yacc)

why do you believe it's ok to change previous releases that don't use that yacc ?

Baptiste Daroussin wrote:
> Hi
> 
> After I replace yacc(1) by byacc(1) on current, we discovered than now it is
> impossible to build 9 on current, because byacc(1) is not 100% backward
> compatible with our yacc(1). this is because building a boostrap yacc(1) is
> conditionned on the version of the host that is building world.
> 
> Looking at Makefile.inc1 I can see that lots of tools are conditionned like
> this. I think if we want to go to be able to cross build the tree (I remember
> from EuroBSDcon that this is something we want to do) then we need to remove the
> conditions and always boostrap any tool necessary to be able to build the tree.
> 
> so if no one care I'll remove the condition to boostrap at least yacc(1) and
> lex(1) on current, 9, 8 and 7.
> 
> Would be great imho to do the same for any tools needed by the build system.
> 
> regards,
> Bapt


From owner-freebsd-arch@FreeBSD.ORG  Wed May 30 13:02:45 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 27043106566C
	for <arch@FreeBSD.org>; Wed, 30 May 2012 13:02:45 +0000 (UTC)
	(envelope-from bapt@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 0B17F8FC25;
	Wed, 30 May 2012 13:02:45 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q4UD2id6041053;
	Wed, 30 May 2012 13:02:44 GMT (envelope-from bapt@FreeBSD.org)
Received: (from bapt@localhost)
	by freefall.freebsd.org (8.14.5/8.14.5/Submit) id q4UD2iQc041037;
	Wed, 30 May 2012 13:02:44 GMT (envelope-from bapt@FreeBSD.org)
X-Authentication-Warning: freefall.freebsd.org: bapt set sender to
	bapt@FreeBSD.org using -f
Date: Wed, 30 May 2012 15:02:41 +0200
From: Baptiste Daroussin <bapt@FreeBSD.org>
To: "John D. Hendrickson and Sara Darnell" <johnandsara2@cox.net>
Message-ID: <20120530130241.GH9952@ithaqua.etoilebsd.net>
References: <20120526235510.GB90668@ithaqua.etoilebsd.net>
	<4FC60E8C.1070204@cox.net>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="+Z7/5fzWRHDJ0o7Q"
Content-Disposition: inline
In-Reply-To: <4FC60E8C.1070204@cox.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: arch@FreeBSD.org
Subject: Re: switch tounconditionnal boostrapping while to build the tree
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 May 2012 13:02:45 -0000


--+Z7/5fzWRHDJ0o7Q
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, May 30, 2012 at 08:11:56AM -0400, John D. Hendrickson and Sara Darn=
ell wrote:
> i find the statements hard to believe
>=20
> why are you doing it that way ?  (using a broken yacc)
It is not a broken yacc. the yacc import just revealed another problem whic=
h is
boostrap tools may needs to be always boostraped (which makes sense if you
really want to support cross-compilation from nearly anywhere.

> why do you believe it's ok to change previous releases that don't use tha=
t yacc ?

To build able to build them on a system that do not have the yacc version t=
hey
need, that system could be linux for example or it could be a recent head

regards,
Bapt

--+Z7/5fzWRHDJ0o7Q
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iEYEARECAAYFAk/GGnEACgkQ8kTtMUmk6EwUGQCgldam/3Nt535vMIi9DcHoGy9S
q58An0ztuebkuqaVYDn0bTZ19KoPXiFu
=ZpeD
-----END PGP SIGNATURE-----

--+Z7/5fzWRHDJ0o7Q--

From owner-freebsd-arch@FreeBSD.ORG  Wed May 30 22:01:31 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C98D7106564A;
	Wed, 30 May 2012 22:01:31 +0000 (UTC) (envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id 7DDC18FC12;
	Wed, 30 May 2012 22:01:31 +0000 (UTC)
Received: from [10.30.101.53] ([209.117.142.2]) (authenticated bits=0)
	by harmony.bsdimp.com (8.14.4/8.14.3) with ESMTP id q4ULmXRa048852
	(version=TLSv1/SSLv3 cipher=DHE-DSS-AES128-SHA bits=128 verify=NO);
	Wed, 30 May 2012 15:48:35 -0600 (MDT) (envelope-from imp@bsdimp.com)
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: text/plain; charset=us-ascii
From: Warner Losh <imp@bsdimp.com>
In-Reply-To: <20120530130241.GH9952@ithaqua.etoilebsd.net>
Date: Wed, 30 May 2012 15:48:28 -0600
Content-Transfer-Encoding: quoted-printable
Message-Id: <722ECB48-6C82-4FF0-AC18-02910DBD0B66@bsdimp.com>
References: <20120526235510.GB90668@ithaqua.etoilebsd.net>
	<4FC60E8C.1070204@cox.net>
	<20120530130241.GH9952@ithaqua.etoilebsd.net>
To: Baptiste Daroussin <bapt@FreeBSD.org>
X-Mailer: Apple Mail (2.1084)
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(harmony.bsdimp.com [10.0.0.6]);
	Wed, 30 May 2012 15:48:35 -0600 (MDT)
Cc: arch@FreeBSD.org,
	"John D. Hendrickson and Sara Darnell" <johnandsara2@cox.net>
Subject: Re: switch tounconditionnal boostrapping while to build the tree
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 May 2012 22:01:31 -0000


On May 30, 2012, at 7:02 AM, Baptiste Daroussin wrote:

> On Wed, May 30, 2012 at 08:11:56AM -0400, John D. Hendrickson and Sara =
Darnell wrote:
>> i find the statements hard to believe
>>=20
>> why are you doing it that way ?  (using a broken yacc)
> It is not a broken yacc. the yacc import just revealed another problem =
which is
> boostrap tools may needs to be always boostraped (which makes sense if =
you
> really want to support cross-compilation from nearly anywhere.

Cross build support doesn't require that you break things like that.  =
Never had, and never will.  The FreeBSD version is irrelevant to cross =
building, so bootstrapping checks are still needed.  In the cross build, =
the bootstrapping OS version will be 0 and we'll build everything we =
need (possibly more than we would bootstrapping from supported FreeBSD =
versions).

>> why do you believe it's ok to change previous releases that don't use =
that yacc ?
>=20
> To build able to build them on a system that do not have the yacc =
version they
> need, that system could be linux for example or it could be a recent =
head.

You can accomplish this without blowing away the conditionals.

Warner


From owner-freebsd-arch@FreeBSD.ORG  Thu May 31 05:29:19 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 1419D106566B
	for <arch@FreeBSD.org>; Thu, 31 May 2012 05:29:19 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id AEE158FC16
	for <arch@FreeBSD.org>; Thu, 31 May 2012 05:29:18 +0000 (UTC)
Received: from 63.imp.bsdimp.com (63.imp.bsdimp.com [10.0.0.63])
	(authenticated bits=0)
	by harmony.bsdimp.com (8.14.4/8.14.3) with ESMTP id q4V5RImQ051940
	(version=TLSv1/SSLv3 cipher=DHE-DSS-AES128-SHA bits=128 verify=NO)
	for <arch@freebsd.org>; Wed, 30 May 2012 23:27:18 -0600 (MDT)
	(envelope-from imp@bsdimp.com)
From: Warner Losh <imp@bsdimp.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Date: Wed, 30 May 2012 23:27:18 -0600
Message-Id: <FC93BFD4-DB63-4C23-9B71-7840578B34BB@bsdimp.com>
To: arch@FreeBSD.org
Mime-Version: 1.0 (Apple Message framework v1084)
X-Mailer: Apple Mail (2.1084)
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(harmony.bsdimp.com [10.0.0.6]);
	Wed, 30 May 2012 23:27:18 -0600 (MDT)
Cc: 
Subject: rman_await_resource
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 May 2012 05:29:19 -0000

Is anybody using rman_await_resource?  I can see no in-tree users, and =
the code's locking looks dubious.  I'd like to just delete it.

Warner


From owner-freebsd-arch@FreeBSD.ORG  Thu May 31 12:24:25 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 9B2A31065680;
	Thu, 31 May 2012 12:24:25 +0000 (UTC) (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 6AD5C8FC0A;
	Thu, 31 May 2012 12:24:24 +0000 (UTC)
Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua
	[212.40.38.101])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id PAA01061;
	Thu, 31 May 2012 15:23:57 +0300 (EEST)
	(envelope-from avg@FreeBSD.org)
Message-ID: <4FC762DD.90101@FreeBSD.org>
Date: Thu, 31 May 2012 15:23:57 +0300
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
	rv:12.0) Gecko/20120503 Thunderbird/12.0.1
MIME-Version: 1.0
To: Christoph Hellwig <hch@infradead.org>, d@delphij.net,
	freebsd-arch@FreeBSD.org
References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no>
	<CAJ-VmokY+pgcq999NHShbq-3rK3=oeWT2WY7NmTvVdXOHZJhdg@mail.gmail.com>
	<CAF6rxgmDW21aPJ5Mp6Tbk1z02ivw4UPhSaNEX+Wiu7O0v13skA@mail.gmail.com>
	<20120517055425.GA802@infradead.org>
In-Reply-To: <20120517055425.GA802@infradead.org>
X-Enigmail-Version: 1.5pre
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: Eitan Adler <lists@eitanadler.com>, Adrian Chadd <adrian@FreeBSD.org>,
	=?ISO-8859-1?Q?Dag-Erling_Sm=F8?=, =?ISO-8859-1?Q?rgrav?= <des@des.no>
Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged
	process?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 May 2012 12:24:25 -0000

on 17/05/2012 08:54 Christoph Hellwig said the following:
> Linux has added a RLIMIT_MEMLOCK opcode for setrlimit that allows
> controlling the amount of memory users can lock down, with a default
> of a single page for unprivilegued processes.

In fact, FreeBSD also has this rlimit and there seems to be full support for it on
both user and kernel sides.
OTOH, PRIV_VM_MLOCK privilege seems to be granted only to the super-user in the
default configuration.  And this privilege kind of defeats the limit.

Perhaps, we should/could kill the privilege and set the limit to a sufficiently
small/safe value for ordinary users?

P.S.
Some MAC code has this comment:
/*
 * Allow VM privileges; it would be nice if these were subject to
 * resource limits.
 */
case PRIV_VM_MADV_PROTECT:
case PRIV_VM_MLOCK:

In the case of PRIV_VM_MLOCK it would be nice if one hand knew what the other is
doing :-)

P.P.S.
I would really like to see RLIMIT_NICE and RLIMIT_RTPRIO in FreeBSD.

-- 
Andriy Gapon

From owner-freebsd-arch@FreeBSD.ORG  Thu May 31 16:23:27 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4FC86106564A
	for <arch@freebsd.org>; Thu, 31 May 2012 16:23:27 +0000 (UTC)
	(envelope-from gnn@freebsd.org)
Received: from vps.hungerhost.com (vps.hungerhost.com [216.38.53.176])
	by mx1.freebsd.org (Postfix) with ESMTP id 117A78FC0C
	for <arch@freebsd.org>; Thu, 31 May 2012 16:23:27 +0000 (UTC)
Received: from [209.249.190.124] (port=55088 helo=[10.2.212.229])
	by vps.hungerhost.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.77)
	(envelope-from <gnn@freebsd.org>)
	id 1Sa89g-0004dH-F6; Thu, 31 May 2012 12:23:23 -0400
Mime-Version: 1.0 (Apple Message framework v1278)
Content-Type: text/plain; charset=us-ascii
From: George Neville-Neil <gnn@freebsd.org>
In-Reply-To: <20120528190300.3a43fc8d@fabiankeil.de>
Date: Thu, 31 May 2012 12:23:22 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <F68B592D-234D-4E75-BECC-6B9295779C37@freebsd.org>
References: <86wr40tfhf.wl%gnn@neville-neil.com>
	<20120528190300.3a43fc8d@fabiankeil.de>
To: Fabian Keil <freebsd-listen@fabiankeil.de>
X-Mailer: Apple Mail (2.1278)
X-AntiAbuse: This header was added to track abuse,
	please include it with any abuse report
X-AntiAbuse: Primary Hostname - vps.hungerhost.com
X-AntiAbuse: Original Domain - freebsd.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - freebsd.org
Cc: arch@freebsd.org
Subject: Re: RFC: A trial io provider for DTrace...
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 May 2012 16:23:27 -0000


On May 28, 2012, at 13:03 , Fabian Keil wrote:

> Worked for me when applying with -p0.
>=20
Great!

>> The arguments are not exactly the same as in Solaris, for instance I
>> don't yet support the fileinfo_t, but, you can get to the devstat and
>> bio structures via args[0] and args[1] respectively.
>>=20
>> Here is an example of it working:
>>=20
>> dtrace -n 'io:::start /args[0] !=3D 0/{ trace(args[0]->bio_bcount)}'
>>=20
>> Remember you need to be root to use DTrace.
>=20
> Do you intent to eventually commit your patch to get dtrace working
> with sudo? I've been using it since you posted it last October and
> haven't seen any issues.
> =
http://lists.freebsd.org/pipermail/freebsd-current/2011-October/028120.htm=
l
>=20

Sorry, what I meant was that you needed root privilege to run DTrace,
sudo will give you that.

> I got:
>=20
> clang -c -O2 -pipe -fno-strict-aliasing  -std=3Dc99 -g -Wall =
-Wredundant-decls -Wnested-externs -Wstrict-prototypes  =
-Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual
>    /usr/src/sys/kern/subr_devstat.c:390:2: error: use of undeclared =
identifier 'bs'
>            DTRACE_DEVSTAT_BIO_DONE();
>            ^
>    /usr/src/sys/kern/subr_devstat.c:76:41: note: expanded from macro =
'DTRACE_DEVSTAT_BIO_DONE'
>                    (*dtrace_io_done_probe)(dtio_done_id, bs, ds);
>                                                          ^
>    1 error generated.
>    *** Error code 1
>=20
>    Stop in /usr/obj/usr/src/sys/ZOEY.
>    *** Error code 1
>=20
>    Stop in /usr/src.
>    *** [buildkernel] Error code 1
>=20
> and used the following patch to get it to compile:
>=20
> diff --git a/sys/kern/subr_devstat.c b/sys/kern/subr_devstat.c
> index e2b6d21..732bf9c 100644
> --- a/sys/kern/subr_devstat.c
> +++ b/sys/kern/subr_devstat.c
> @@ -73,7 +73,7 @@ uint32_t      dtio_wait_done_id;
>=20
> #define DTRACE_DEVSTAT_BIO_DONE() \
>        if (dtrace_io_done_probe !=3D NULL) \
> -               (*dtrace_io_done_probe)(dtio_done_id, bs, ds);
> +               (*dtrace_io_done_probe)(dtio_done_id, bp, ds);
>=20
> Other than that the provider seems to work fine so far.

OK, let me get that fixed up and put up a newer patch.

Thanks for testing and replying!

Best,
George


From owner-freebsd-arch@FreeBSD.ORG  Thu May 31 18:08:47 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 93901106564A
	for <arch@freebsd.org>; Thu, 31 May 2012 18:08:47 +0000 (UTC)
	(envelope-from gnn@neville-neil.com)
Received: from vps.hungerhost.com (vps.hungerhost.com [216.38.53.176])
	by mx1.freebsd.org (Postfix) with ESMTP id 61FCD8FC14
	for <arch@freebsd.org>; Thu, 31 May 2012 18:08:47 +0000 (UTC)
Received: from [209.249.190.124] (port=56608 helo=[10.2.212.229])
	by vps.hungerhost.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.77)
	(envelope-from <gnn@neville-neil.com>)
	id 1Sa9ni-0006Aa-8q; Thu, 31 May 2012 14:08:46 -0400
Mime-Version: 1.0 (Apple Message framework v1278)
Content-Type: text/plain; charset=us-ascii
From: George Neville-Neil <gnn@neville-neil.com>
In-Reply-To: <F68B592D-234D-4E75-BECC-6B9295779C37@freebsd.org>
Date: Thu, 31 May 2012 14:08:47 -0400
Content-Transfer-Encoding: 7bit
Message-Id: <738E93BC-3BEE-4792-9249-C2233EE8D7C6@neville-neil.com>
References: <86wr40tfhf.wl%gnn@neville-neil.com>
	<20120528190300.3a43fc8d@fabiankeil.de>
	<F68B592D-234D-4E75-BECC-6B9295779C37@freebsd.org>
To: Fabian Keil <freebsd-listen@fabiankeil.de>
X-Mailer: Apple Mail (2.1278)
X-AntiAbuse: This header was added to track abuse,
	please include it with any abuse report
X-AntiAbuse: Primary Hostname - vps.hungerhost.com
X-AntiAbuse: Original Domain - freebsd.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - neville-neil.com
Cc: arch@freebsd.org
Subject: Re: RFC: A trial io provider for DTrace...
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 May 2012 18:08:47 -0000

OK, the latest patch, with Fabien's fix, is up at:

http://people.freebsd.org/~gnn/dtio_provider_2.diff

Best,
George


From owner-freebsd-arch@FreeBSD.ORG  Fri Jun  1 01:46:10 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx2.freebsd.org (mx2.freebsd.org [69.147.83.53])
	by hub.freebsd.org (Postfix) with ESMTP id A0C471065670;
	Fri,  1 Jun 2012 01:46:10 +0000 (UTC)
	(envelope-from dougb@FreeBSD.org)
Received: from [127.0.0.1] (hub.freebsd.org [IPv6:2001:4f8:fff6::36])
	by mx2.freebsd.org (Postfix) with ESMTP id DFD4CB2CE4;
	Fri,  1 Jun 2012 01:40:43 +0000 (UTC)
Message-ID: <4FC81D9C.2080801@FreeBSD.org>
Date: Thu, 31 May 2012 18:40:44 -0700
From: Doug Barton <dougb@FreeBSD.org>
Organization: http://www.FreeBSD.org/
User-Agent: Mozilla/5.0 (Windows NT 5.1;
	rv:12.0) Gecko/20120428 Thunderbird/12.0.1
MIME-Version: 1.0
To: Andriy Gapon <avg@FreeBSD.org>
References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no>
	<CAJ-VmokY+pgcq999NHShbq-3rK3=oeWT2WY7NmTvVdXOHZJhdg@mail.gmail.com>
	<CAF6rxgmDW21aPJ5Mp6Tbk1z02ivw4UPhSaNEX+Wiu7O0v13skA@mail.gmail.com>
	<20120517055425.GA802@infradead.org> <4FC762DD.90101@FreeBSD.org>
In-Reply-To: <4FC762DD.90101@FreeBSD.org>
X-Enigmail-Version: 1.4.1
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: =?ISO-8859-1?Q?Dag-Erling_Sm=F8?=@FreeBSD.ORG,
	Adrian Chadd <adrian@FreeBSD.org>, d@delphij.net,
	Eitan Adler <lists@eitanadler.com>, freebsd-arch@FreeBSD.org,
	rgrav <des@des.no>
Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged
	process?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Jun 2012 01:46:10 -0000

On 5/31/2012 5:23 AM, Andriy Gapon wrote:
> In fact, FreeBSD also has this rlimit and there seems to be full support for it on
> both user and kernel sides.
> OTOH, PRIV_VM_MLOCK privilege seems to be granted only to the super-user in the
> default configuration.  And this privilege kind of defeats the limit.
> 
> Perhaps, we should/could kill the privilege and set the limit to a sufficiently
> small/safe value for ordinary users?

I like this idea, but someone else in the thread (sorry, don't have it
handy) brought up the point that we don't want the aggregate of per-user
limits to be able to bring down the system either. So the right solution
would seem to be a reasonable per-user limit, and a cap on the maximum
total amount of locked pages for all unprivileged users, probably based
on some percentage of total available memory?

Doug

-- 

    This .signature sanitized for your protection

From owner-freebsd-arch@FreeBSD.ORG  Fri Jun  1 15:41:22 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5BE221065679
	for <freebsd-arch@FreeBSD.org>; Fri,  1 Jun 2012 15:41:22 +0000 (UTC)
	(envelope-from bryan@shatow.net)
Received: from secure.xzibition.com (secure.xzibition.com [173.160.118.92])
	by mx1.freebsd.org (Postfix) with ESMTP id A98AE8FC12
	for <freebsd-arch@FreeBSD.org>; Fri,  1 Jun 2012 15:41:21 +0000 (UTC)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=shatow.net; h=message-id
	:date:from:mime-version:to:cc:subject:references:in-reply-to
	:content-type; q=dns; s=sweb; b=OQz+T/sQiaxmoD9DZU2K7Wgxou2gtSX9
	6EPyS2MoR2nEHhjfLHYEtTyS9XFb8LDzPHCkI/1obWkDdGa2M7NanfWm42P9A+xF
	0EMgTbh6M2YYtG6vuRwNEIoFyuVVuGNGv2fGI2ygBIjeou0yC4QDaiS/oNivG7NK
	FbPzeHOXuEI=
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=shatow.net; h=message-id
	:date:from:mime-version:to:cc:subject:references:in-reply-to
	:content-type; s=sweb; bh=kBrTc/W9qIgsyOKMzK/PVWXkHzwP4HWcGp7Xnh
	UzNmk=; b=swaLiDY9Fyw/S2EihR7Km6i3EJDI9uFyI+o7OkIYu6w6SJ8ozOb7gS
	6nb4n+W3UZVdYzp7CcfDQ7sl3fSOYynu9aLPkiH4W8NPhvuOqa/fPrpuvIe7V8oe
	F8mjRPRZu6g9rS5Zn3wIlsWJNFi0ijBz7NxnnjZFVsfL9OnDK4JTk=
Received: (qmail 3984 invoked from network); 1 Jun 2012 10:41:17 -0500
Received: from unknown (HELO ?192.168.21.109?) (bryan@shatow.net@74.94.87.209)
	by sweb.xzibition.com with ESMTPA; 1 Jun 2012 10:41:17 -0500
Message-ID: <4FC8E29F.2010806@shatow.net>
Date: Fri, 01 Jun 2012 10:41:19 -0500
From: Bryan Drewery <bryan@shatow.net>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
	rv:12.0) Gecko/20120428 Thunderbird/12.0.1
MIME-Version: 1.0
To: Doug Barton <dougb@FreeBSD.org>
References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no>
	<CAJ-VmokY+pgcq999NHShbq-3rK3=oeWT2WY7NmTvVdXOHZJhdg@mail.gmail.com>
	<CAF6rxgmDW21aPJ5Mp6Tbk1z02ivw4UPhSaNEX+Wiu7O0v13skA@mail.gmail.com>
	<20120517055425.GA802@infradead.org>
	<4FC762DD.90101@FreeBSD.org> <4FC81D9C.2080801@FreeBSD.org>
In-Reply-To: <4FC81D9C.2080801@FreeBSD.org>
X-Enigmail-Version: 1.4.1
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature";
	boundary="------------enigB651918E900EB354EF176708"
Cc: =?ISO-8859-1?Q?Dag-Erling_Sm=F8?=@FreeBSD.ORG,
	Adrian Chadd <adrian@FreeBSD.org>, d@delphij.net,
	Andriy Gapon <avg@FreeBSD.org>,
	Eitan Adler <lists@eitanadler.com>, freebsd-arch@FreeBSD.org,
	rgrav <des@des.no>
Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged
	process?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Jun 2012 15:41:22 -0000

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigB651918E900EB354EF176708
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On 5/31/2012 8:40 PM, Doug Barton wrote:
> On 5/31/2012 5:23 AM, Andriy Gapon wrote:
>> In fact, FreeBSD also has this rlimit and there seems to be full suppo=
rt for it on
>> both user and kernel sides.
>> OTOH, PRIV_VM_MLOCK privilege seems to be granted only to the super-us=
er in the
>> default configuration.  And this privilege kind of defeats the limit.
>>
>> Perhaps, we should/could kill the privilege and set the limit to a suf=
ficiently
>> small/safe value for ordinary users?
>=20
> I like this idea, but someone else in the thread (sorry, don't have it
> handy) brought up the point that we don't want the aggregate of per-use=
r
> limits to be able to bring down the system either. So the right solutio=
n
> would seem to be a reasonable per-user limit, and a cap on the maximum
> total amount of locked pages for all unprivileged users, probably based=

> on some percentage of total available memory?
>=20
> Doug
>=20

I like this approach. A per-user ulimit, and a global max sysctl that
can be overridden, but by default based on a percentage of available memo=
ry.

--=20
Regards,
Bryan Drewery


--------------enigB651918E900EB354EF176708
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJPyOKjAAoJEG54KsA8mwz5DBcP/2Z14YhYXnsnl2h3yAIcrB04
89cEVNWqeSaRhRrenGkTDI3qhpzd19D/huugd50YT9L+HJUehmBbL8kL0a6tc0KF
8COlXldFOWL1v3TmXgbkirE9+eEp1AoGh/f/SiKDBLPwufLMOO/NMvElPSgkofV1
sVFy56824PELgaK0aUeqNYSM+VzCGlgetVCJuyBSs6TguBIp21A9/W+UIfRb3ZLI
mdVIjhZyzHMzFz8PbdSkVv7PMoCW/hEhHELDZTgiVShX7UjbE7rTmOQoOILPgv/B
xPgUv6FdSD3OkRBy1v0TXunnj8ztdolEU0rpkBQASFI0meoYcAnh9ixvLZESK9Rt
remsIzaynZOqnOfATuPT9ukehf52Yz1O2qTH148H9Ija9+V0gI0n0SpXnu4RHQ92
fCwGHGNq0yw1LmvzA1qWPRRXc+RcVERowPLA0ILCwCwtUFBUnymy4qdZsmJyNLZ7
SpB5DMTM6vB9eiUrOGdFUfh/xqQDNcMJcPuWlUTHrzHADkKe+Qch4QhIg7q5shBK
46a5BT4IFeEqjNZuNZm/jfF7FsIPcCweerwHpM46d12COj2iglMgy/BFuuBmVjSJ
jtfltEZI3FmCfIZOWzZfbnDhreVdE+ATESD49PKOyINDv7K2UvMfrg5O7ywSY61n
m5goeGBBQ7E5suuiniGj
=jY8i
-----END PGP SIGNATURE-----

--------------enigB651918E900EB354EF176708--

From owner-freebsd-arch@FreeBSD.ORG  Fri Jun  1 17:53:17 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 9E818106566B;
	Fri,  1 Jun 2012 17:53:17 +0000 (UTC)
	(envelope-from giovanni.trematerra@gmail.com)
Received: from mail-qc0-f182.google.com (mail-qc0-f182.google.com
	[209.85.216.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 03AA58FC15;
	Fri,  1 Jun 2012 17:53:16 +0000 (UTC)
Received: by qcsg15 with SMTP id g15so1553323qcs.13
	for <multiple recipients>; Fri, 01 Jun 2012 10:53:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:date:message-id:subject:from:to:cc:content-type;
	bh=8/It4euG+yHm4eSPneqe8eBPwiZ6874h6vDAXdbNymI=;
	b=t2vR3lmjLV37of+Db1iJPqKztI50KWnG/QRrjd2HodnbJfz+dwqvDVa2Mlews3J8L+
	wgWfFeJqGOB/2mjX/e+HVydi429ZkjzXKkmfrt/li1o5sT+5Zp/ylxJTMPANymCtWIVR
	mASMAbo6Pp8PCOzSbLD9zvaJN35S6nBUPNzjcRpulOC3HxjeJwWikuqND2CCgyJGBu+V
	6O6Dp3/WKgli55pvzLu+Tg1W06Wc5mYUzHtnR6PhRG5cO0Ia1NGwFzC068TRJffM9nI+
	THnHv5hBPwkL0vANmoe/8rLcqsgpM6lKHIZ+JmKaJ8poE/Y9HCgjoTdeHGpJ3qJB4V4x
	jxag==
MIME-Version: 1.0
Received: by 10.224.184.82 with SMTP id cj18mr3089654qab.81.1338573195923;
	Fri, 01 Jun 2012 10:53:15 -0700 (PDT)
Received: by 10.229.160.20 with HTTP; Fri, 1 Jun 2012 10:53:15 -0700 (PDT)
Date: Fri, 1 Jun 2012 19:53:15 +0200
Message-ID: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
From: Giovanni Trematerra <giovanni.trematerra@gmail.com>
To: freebsd-arch@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1
Cc: Attilio Rao <attilio@freebsd.org>, alc@freebsd.org,
	Konstantin Belousov <kib@freebsd.org>, Alexander Kabaev <kan@freebsd.org>
Subject: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Jun 2012 17:53:17 -0000

Hello,
I'd like to discuss a way to provide a mechanism to share some read-only
data between kernel and user space programs avoiding syscall overhead,
implementing some them, such as gettimeofday(3) and time(3) as ordinary
user space routine.

The patch at
http://www.trematerra.net/patches/ksvar_experimental.patch

is in a very experimental stage. It's just a proof-of-concept.
Only works for an AMD64 kernel and only for 64-bit applications.
The idea is to have all the variables that we want to share between kernel
and user space into one or more consecutive pages of memory that will be
mapped read-only into every running process. At the start of the first
shared page
there'll be a table with as many entries as the number of the shared variables.
Each entry is a 32-bit value that is the offset between the start of the shared
page and the start of the variable in the page. The user space processes need
to find out the map address of shared page and use the table to access to the
shared variables.
Kernel will export a variable to user space as an index, so user space code
must refer to a specific index to access a kernel shared variable.
Let's take a quick look to the KPI/API for exporting/importing kernel
shared variables.
Say we want implement a routine to export an int from the kernel.
To define the variable to be exported inside the kernel you would use

KSVAR_DEFINE(0, int, test_value);

You have just defined an int variable named "test_value" at index 0.
Inside the kernel you can write/read as usual using the symbol test_value;
Now you likely want add to libc a function callable from user processes
that return the test_value variable. So first of all you need the import the
variable.

KSVAR_IMPORT(0, int, test_value);

and to obtain a pointer to read the value you would use

KSVAR(test_value);

so your function would look like something like this

int get_test_value()
{

     return (*KSVAR(test_value));
}

Then inside your process just call get_test_value() function as you usually
do and you'll get a kernel written value without switching in kernel mode.

Let's see now in more detail how that could be accomplished.
The shared variables will be accessed as normal variables and are read/write
inside the kernel. The variables need to be inside the same page(s) and nothing
but the shared variables (and the table) must be into the page(s). To
obtain that
I changed the linker script in this way

--- a/sys/conf/ldscript.amd64
+++ b/sys/conf/ldscript.amd64
@@ -177,6 +177,15 @@ SECTIONS
    *(.ldata .ldata.* .gnu.linkonce.l.*)
    . = ALIGN(. != 0 ? 64 / 8 : 1);
  }
+  .ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) :
+  {
+    __ksvar_set_start = .;
+    *(.ksvar_table)
+    *(.ksvar)
+
+   . = ALIGN(CONSTANT (COMMONPAGESIZE));
+   __ksvar_set_stop = .;
+  }
  . = ALIGN(64 / 8);
  _end = .; PROVIDE (end = .);
  . = DATA_SEGMENT_END (.);

When we want to define a variable in the kernel to share with user space
we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h

+struct ksvar_set {
+       uint32_t idx;
+       char *pksvar;
+};
+
+/*
+ * Declare a variable into kernel shared linker_set.
+ */
+#define        KSVAR_DEFINE(index, type, name) \
+       static type name __section(".ksvar");                   \
+       static struct ksvar_set name ## _ksvar_set = {          \
+               .idx = index,                                   \
+               .pksvar = (char *) &name                        \
+       };                                                      \
+       DATA_SET(ksvar_set, name ## _ksvar_set)

Every variable must have a unique index. The indexes must
start from zero and be consecutive. When you add an index
you must bump the size of the table (KSVAR_TABLE_SIZE)
(see sys/sys/ksvar.h)

The variables are inside the kernel static image that isn't managed
by the VM and so we need to allocate pages to map the physical addresses.
A new SYSINIT (ksvarinit) will allocate a set of vm_page_t  through
the vm_phys_fictitious_reg_range interface and fill the table using
the information
of the ksvar_set linker set, then will create a vm_object_t (vm_object_ksvar),
mark the fake pages as valid and put them into it.
When a new process is created by exec(3) the vm_object_ksvar will be
mapped read-only into the process address space by vm_map_fixed routine
just before mapping the user stack. The address of mapping will be recorded
inside the new p_ksvar field of the struct proc.
This field will be exported through a sysctl to the user space processes.
In order to implement syscalls as user space routines, we have to find out the
mapped address of the kernel shared variables when the libc is mapped into
the process. So I added a function marked with the attribute constructor.
It will called before any code into user process and before any code inside
the libc.

+__attribute((constructor)) void init_kernel_shared()
+{
+       int mib[2];
+       size_t len;
+       vm_offset_t ksvar_address;
+
+       mib[0] = CTL_KERN;
+       mib[1] = KERN_KSVAR;
+       len = sizeof(vm_offset_t);
+       if (__sysctl(mib, 2, (void *) &ksvar_address, &len, NULL, 0) != -1)
+               ksvar_table = (uint32_t *) ksvar_address;
+}

Once the libc knows the address of the table it can access to the shared
variables.

Just as proof of concept I re-implemented gettimeofday(3) in user space.
First of all I didn't remove the entry into the syscall.master, just renamed the
sys_gettimeofday. I need it for the fallback path.
In the kernel I introduced a struct wall_clock.

+struct wall_clock
+{
+       struct timeval  tv;
+       struct timezone tz;
+};

The struct is exported through sys/sys/time.h header.
I defined a new kernel shared variable. To do so I added an index in
sys/sys/ksvar.h
WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1.
In the sys/kern/kern_clocksource.c

+/* kernel shared variable for implmenting gettimeofday. */
+KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);

Now we defined a shared variable at index WALL_CLOCK_INDEX of type
struct wall_clock and named wall_clock.
Inside handleevents I update the info exported by wall_clock.

+       struct timeval tv;
+
+       /* update time for userspace gettimeofday */
+       microtime(&tv);
+       wall_clock.tv = tv;
+       wall_clock.tz.tz_minuteswest = tz_minuteswest;
+       wall_clock.tz.tz_dsttime = tz_dsttime;

Now, in libc we import the shared variable

+KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);

note that WALL_CLOCK_INDEX must be the same of the one defined
inside the kernel, and define a new function gettimeofday

+int
+gettimeofday(struct timeval *tp, struct timezone *tzp)
+{
+
+       /* fallback to syscall if kernel doesn't export ksvar */
+       if (!KSVAR_IS_ACTIVE())
+               return (sys_gettimeofday(tp, tzp));
+
+       if (tp != NULL)
+               *tp = KSVAR(wall_clock)->tv;
+       if (tzp != NULL)
+               *tzp = KSVAR(wall_clock)->tz;
+       return (0);
+}

Now when a process will call getimeofday, will call that function actually.
If the process makes a lot of call to gettimeofday, we will see a
performance boost.
Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE),
the function
fallback to call the actual syscall (sys_gettimeofday).

Open tasks
- implement support for 32-bit emulated processes running in a 64-bit
environment.
- extend support to others arch
- implement more syscalls
- benchmarks
- Test, test, test.

I'm looking forward to hear about your comments and suggestions.

--
Gianni

From owner-freebsd-arch@FreeBSD.ORG  Fri Jun  1 19:22:30 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1DEA3106566B;
	Fri,  1 Jun 2012 19:22:30 +0000 (UTC)
	(envelope-from lev@serebryakov.spb.ru)
Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru
	[IPv6:2a01:4f8:131:60a2::2])
	by mx1.freebsd.org (Postfix) with ESMTP id A761A8FC18;
	Fri,  1 Jun 2012 19:22:29 +0000 (UTC)
Received: from lion.home.serebryakov.spb.ru (unknown
	[IPv6:2001:470:923f:1:756c:80e7:ffb9:a0c4])
	(Authenticated sender: lev@serebryakov.spb.ru)
	by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 1E1304AC2D; 
	Fri,  1 Jun 2012 23:22:26 +0400 (MSK)
Date: Fri, 1 Jun 2012 23:22:20 +0400
From: Lev Serebryakov <lev@serebryakov.spb.ru>
X-Priority: 3 (Normal)
Message-ID: <681265513.20120601232220@serebryakov.spb.ru>
To: Giovanni Trematerra <giovanni.trematerra@gmail.com>
In-Reply-To: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=windows-1251
Content-Transfer-Encoding: quoted-printable
Cc: Attilio Rao <attilio@freebsd.org>, alc@freebsd.org,
	Konstantin Belousov <kib@freebsd.org>,
	Alexander Kabaev <kan@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Jun 2012 19:22:30 -0000

Hello, Giovanni.
You wrote 1 =E8=FE=ED=FF 2012 =E3., 21:53:15:

GT> I'm looking forward to hear about your comments and suggestions.
   It is great, that you start this work!

   This approach was discussed several times according to my memory, as
 way to make cheap sysclass like gettimeofday() (because some
 Linux-orientet programs like to call them very often and have them
 cheap is very good idea), and every time conclusion was, that it is
 very good approach, but without any resulting code.


--=20
// Black Lion AKA Lev Serebryakov <lev@serebryakov.spb.ru>


From owner-freebsd-arch@FreeBSD.ORG  Fri Jun  1 19:35:30 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A8B9D1065741;
	Fri,  1 Jun 2012 19:35:30 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id CAF758FC17;
	Fri,  1 Jun 2012 19:35:29 +0000 (UTC)
Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q51JZMlU093011;
	Fri, 1 Jun 2012 22:35:22 +0300 (EEST)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id
	q51JZMFW056808; Fri, 1 Jun 2012 22:35:22 +0300 (EEST)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q51JZMsi056807; 
	Fri, 1 Jun 2012 22:35:22 +0300 (EEST)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Fri, 1 Jun 2012 22:35:22 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Giovanni Trematerra <giovanni.trematerra@gmail.com>
Message-ID: <20120601193522.GA2358@deviant.kiev.zoral.com.ua>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="SRK8lRENmpuaYFQC"
Content-Disposition: inline
In-Reply-To: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: Attilio Rao <attilio@freebsd.org>, alc@freebsd.org,
	Alexander Kabaev <kan@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Jun 2012 19:35:30 -0000


--SRK8lRENmpuaYFQC
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote:
> Hello,
> I'd like to discuss a way to provide a mechanism to share some read-only
> data between kernel and user space programs avoiding syscall overhead,
> implementing some them, such as gettimeofday(3) and time(3) as ordinary
> user space routine.
>=20
> The patch at
> http://www.trematerra.net/patches/ksvar_experimental.patch
>=20
> is in a very experimental stage. It's just a proof-of-concept.
> Only works for an AMD64 kernel and only for 64-bit applications.
> The idea is to have all the variables that we want to share between kernel
> and user space into one or more consecutive pages of memory that will be
> mapped read-only into every running process. At the start of the first
> shared page
> there'll be a table with as many entries as the number of the shared vari=
ables.
> Each entry is a 32-bit value that is the offset between the start of the =
shared
> page and the start of the variable in the page. The user space processes =
need
> to find out the map address of shared page and use the table to access to=
 the
> shared variables.
> Kernel will export a variable to user space as an index, so user space co=
de
> must refer to a specific index to access a kernel shared variable.
> Let's take a quick look to the KPI/API for exporting/importing kernel
> shared variables.
> Say we want implement a routine to export an int from the kernel.
> To define the variable to be exported inside the kernel you would use
>=20
> KSVAR_DEFINE(0, int, test_value);
>=20
> You have just defined an int variable named "test_value" at index 0.
> Inside the kernel you can write/read as usual using the symbol test_value;
> Now you likely want add to libc a function callable from user processes
> that return the test_value variable. So first of all you need the import =
the
> variable.
>=20
> KSVAR_IMPORT(0, int, test_value);
>=20
> and to obtain a pointer to read the value you would use
>=20
> KSVAR(test_value);
>=20
> so your function would look like something like this
>=20
> int get_test_value()
> {
>=20
>      return (*KSVAR(test_value));
> }
>=20
> Then inside your process just call get_test_value() function as you usual=
ly
> do and you'll get a kernel written value without switching in kernel mode.
>=20
> Let's see now in more detail how that could be accomplished.
> The shared variables will be accessed as normal variables and are read/wr=
ite
> inside the kernel. The variables need to be inside the same page(s) and n=
othing
> but the shared variables (and the table) must be into the page(s). To
> obtain that
> I changed the linker script in this way
>=20
> --- a/sys/conf/ldscript.amd64
> +++ b/sys/conf/ldscript.amd64
> @@ -177,6 +177,15 @@ SECTIONS
>     *(.ldata .ldata.* .gnu.linkonce.l.*)
>     . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1);
>   }
> +  .ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) :
> +  {
> +    __ksvar_set_start =3D .;
> +    *(.ksvar_table)
> +    *(.ksvar)
> +
> +   . =3D ALIGN(CONSTANT (COMMONPAGESIZE));
> +   __ksvar_set_stop =3D .;
> +  }
>   . =3D ALIGN(64 / 8);
>   _end =3D .; PROVIDE (end =3D .);
>   . =3D DATA_SEGMENT_END (.);
>=20
> When we want to define a variable in the kernel to share with user space
> we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h
>=20
> +struct ksvar_set {
> +       uint32_t idx;
> +       char *pksvar;
> +};
> +
> +/*
> + * Declare a variable into kernel shared linker_set.
> + */
> +#define        KSVAR_DEFINE(index, type, name) \
> +       static type name __section(".ksvar");                   \
> +       static struct ksvar_set name ## _ksvar_set =3D {          \
> +               .idx =3D index,                                   \
> +               .pksvar =3D (char *) &name                        \
> +       };                                                      \
> +       DATA_SET(ksvar_set, name ## _ksvar_set)
>=20
> Every variable must have a unique index. The indexes must
> start from zero and be consecutive. When you add an index
> you must bump the size of the table (KSVAR_TABLE_SIZE)
> (see sys/sys/ksvar.h)
>=20
> The variables are inside the kernel static image that isn't managed
> by the VM and so we need to allocate pages to map the physical addresses.
> A new SYSINIT (ksvarinit) will allocate a set of vm_page_t  through
> the vm_phys_fictitious_reg_range interface and fill the table using
> the information
> of the ksvar_set linker set, then will create a vm_object_t (vm_object_ks=
var),
> mark the fake pages as valid and put them into it.
> When a new process is created by exec(3) the vm_object_ksvar will be
> mapped read-only into the process address space by vm_map_fixed routine
> just before mapping the user stack. The address of mapping will be record=
ed
> inside the new p_ksvar field of the struct proc.
> This field will be exported through a sysctl to the user space processes.
> In order to implement syscalls as user space routines, we have to find ou=
t the
> mapped address of the kernel shared variables when the libc is mapped into
> the process. So I added a function marked with the attribute constructor.
> It will called before any code into user process and before any code insi=
de
> the libc.
>=20
> +__attribute((constructor)) void init_kernel_shared()
> +{
> +       int mib[2];
> +       size_t len;
> +       vm_offset_t ksvar_address;
> +
> +       mib[0] =3D CTL_KERN;
> +       mib[1] =3D KERN_KSVAR;
> +       len =3D sizeof(vm_offset_t);
> +       if (__sysctl(mib, 2, (void *) &ksvar_address, &len, NULL, 0) !=3D=
 -1)
> +               ksvar_table =3D (uint32_t *) ksvar_address;
> +}
>=20
> Once the libc knows the address of the table it can access to the shared
> variables.
>=20
> Just as proof of concept I re-implemented gettimeofday(3) in user space.
> First of all I didn't remove the entry into the syscall.master, just rena=
med the
> sys_gettimeofday. I need it for the fallback path.
> In the kernel I introduced a struct wall_clock.
>=20
> +struct wall_clock
> +{
> +       struct timeval  tv;
> +       struct timezone tz;
> +};
>=20
> The struct is exported through sys/sys/time.h header.
> I defined a new kernel shared variable. To do so I added an index in
> sys/sys/ksvar.h
> WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1.
> In the sys/kern/kern_clocksource.c
>=20
> +/* kernel shared variable for implmenting gettimeofday. */
> +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
>=20
> Now we defined a shared variable at index WALL_CLOCK_INDEX of type
> struct wall_clock and named wall_clock.
> Inside handleevents I update the info exported by wall_clock.
>=20
> +       struct timeval tv;
> +
> +       /* update time for userspace gettimeofday */
> +       microtime(&tv);
> +       wall_clock.tv =3D tv;
> +       wall_clock.tz.tz_minuteswest =3D tz_minuteswest;
> +       wall_clock.tz.tz_dsttime =3D tz_dsttime;
>=20
> Now, in libc we import the shared variable
>=20
> +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
>=20
> note that WALL_CLOCK_INDEX must be the same of the one defined
> inside the kernel, and define a new function gettimeofday
>=20
> +int
> +gettimeofday(struct timeval *tp, struct timezone *tzp)
> +{
> +
> +       /* fallback to syscall if kernel doesn't export ksvar */
> +       if (!KSVAR_IS_ACTIVE())
> +               return (sys_gettimeofday(tp, tzp));
> +
> +       if (tp !=3D NULL)
> +               *tp =3D KSVAR(wall_clock)->tv;
> +       if (tzp !=3D NULL)
> +               *tzp =3D KSVAR(wall_clock)->tz;
> +       return (0);
> +}
>=20
> Now when a process will call getimeofday, will call that function actuall=
y.
> If the process makes a lot of call to gettimeofday, we will see a
> performance boost.
> Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE),
> the function
> fallback to call the actual syscall (sys_gettimeofday).
>=20
> Open tasks
> - implement support for 32-bit emulated processes running in a 64-bit
> environment.
> - extend support to others arch
> - implement more syscalls
> - benchmarks
> - Test, test, test.
>=20
> I'm looking forward to hear about your comments and suggestions.

I very much dislike what you described, it makes ABI maintanence
a nightmare.
Below is some mail I wrote around Spring 2009, making some notes about
desired proposal. This is what called vdso in Linux land.


On Tue, Mar 31, 2009 at 04:04:46PM +0200, Giuseppe Cocomazzi wrote:
> Gentle kib,
> I've understood what you mean to do: you said me to implement the=20
> syscall trampoline as a dynamic shared object to be copied by the kernel=
=20
> in every process shared page; then we would eventually pass the shared=20
> page address to the rtld using a AT_SYSINFO_EHDR. During program=20
> startup, if this is found by the dyn linker, we define a symbol=20
> containing that previously obtained address, which libc could easily
> access for its own syscall wrappers. Ok? Is this your idea? Or didn't I=
=20
> get it at all?
> Now, if I got what you meant, let me explain my already done work on the=
=20
> syscall trampoline.
> My approach does not make use of any dso: the kernel just copies a=20
> little piece of code in the syscall trampoline shared page:
> 	popl	%ecx
> 	int	$0x80
> 	pushl	%ecx
> without any symbol. This would be changed in its sysenter counterpart by=
=20
> cpu_startup, in case the SEP bit is set. A sysctl has been created in=20
> order to let user programs obtain that sc trampoline address.
> Crt has been patched to retrieve the address by means of the sysctl at=20
> run time and then puts this address in a global symbol named 'sctramp'.
> (I want you to know that the sysctl mechanism could be simply avoided if=
=20
>   we decide to have syscall shared pages at a fixed address: actually=20
> they are mapped to maxsaddr. This need to be discussed later, but is not=
=20
> the hot point, now.)
> The 'sctramp' symbol is accessed by libc wrappers to enter the kernel=20
> when issuing system calls:
> 	#define KERNCALL	call *%sctramp
> What I want to say is the following: I think your approach is the same=20
> as mine, in that the rtld has to load a shared object in any case, being=
=20
> it crt or a custom dso. But since crt is already there, why do we need=20
> to create another .so when we already have one which is however linked=20
> in the process address space? Think of crt as your custom dso, and=20
> you'll get the picture.
> Maybe, your approach is more elegant than mine, though mine is more=20
> minimal and less invasive (no symbols to take into account but one,=20
> etc). Furthermore, don't understimate the fact that I've already coded=20
> and tested it: I attach a copy of the whole patch, so that you can have=
=20
> a look at it, accompanied by a little explaining paper. Don't waste your=
=20
> time in reviewing the kernel part (this is rookie's task), concentrate=20
> on the user space part.
> Hope I don't cause a waste of your precious time,
> Regards

Crt is not dso. It is the stub that got linked statically to most
binaries. The actual mechanism you implemented is _ortohonal_ to
decision of having shared page as a dso.

That dso shall not be used to provide any "addresses" to libc. Libc
syscall stabs shall call some functional symbol, that is defined strong
in the dso, and weak in the rtld. Rtld implementation shall be int0x80.
Rtld shall preload the dso, assuming the aux entry supplied by the
kernel contains phdr address of the object.

Features that gives us the dso:
1. Absolute freedom in the layout of the page.
2. Page may implement several entries, among them are
	- syscall (that is what you described above);
	- gettimeofday with optimized implementation (see long threads
		about TSC, APCI HPET etc);
	- getpid
	- machine-optimized copy routines like memcpy, strcpy and so on.
	- signal trampoline code (see #5 below, why this is _very_
		desirable).
	- ... (there I have stopped my imagination)
3. Addition of new symbols does not require any changes to libc to activate
   them, because the standard behaviour of the dynamic linker gives the
   priority to the symbols from the preloaded objects over the symbols
   from the dependencies.
4. Dso gives the right place for the CFI to be found by debuggers and
   exception propagation code (CFI stands for Call Frame Information,
   it is used to allow the stack unwinding to properly restore frames
   and registers). amd64 already suffers from the lack of CFI on signal
   trampolines and sysenter wrappers. Bare shared page is ugly from
   this point of view. Need for CFI was one of the main motivation
   for the dso on Linux.
5. Putting signal trampoline into the shared page instead of top of the
   stack would be a great step into enabling NX bit for the stack.
6. Linuxolator would get the vdso too, that is big deficiency in it now.

As you see, list of the items that are desirable on the shared page
is quite long, and having fixed format is the large problem for
binary compatibility.


--SRK8lRENmpuaYFQC
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk/JGXkACgkQC3+MBN1Mb4iqcQCeKC+6UcscqSD0AkKnVu1QPiTu
VrUAoI0hxz1U92+l9Ka0acuRJXg42AV5
=8QvG
-----END PGP SIGNATURE-----

--SRK8lRENmpuaYFQC--

From owner-freebsd-arch@FreeBSD.ORG  Fri Jun  1 21:21:55 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CB19D106564A;
	Fri,  1 Jun 2012 21:21:55 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au
	[211.29.132.187])
	by mx1.freebsd.org (Postfix) with ESMTP id 2BAE98FC0C;
	Fri,  1 Jun 2012 21:21:55 +0000 (UTC)
Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232])
	by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q51LLjQW002570
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 2 Jun 2012 07:21:46 +1000
Date: Sat, 2 Jun 2012 07:21:45 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Giovanni Trematerra <giovanni.trematerra@gmail.com>
In-Reply-To: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
Message-ID: <20120602044306.S4049@besplex.bde.org>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Attilio Rao <attilio@freebsd.org>, alc@freebsd.org,
	Konstantin Belousov <kib@freebsd.org>,
	Alexander Kabaev <kan@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Jun 2012 21:21:55 -0000

On Fri, 1 Jun 2012, Giovanni Trematerra wrote:

> I'd like to discuss a way to provide a mechanism to share some read-only
> data between kernel and user space programs avoiding syscall overhead,
> implementing some them, such as gettimeofday(3) and time(3) as ordinary
> user space routine.

This is particularly unsuitable for implementing gettimeofday(), since for
it to work you would need to use approximately 1 CPU spinning in the
kernel to update the time every microsecond.  For time(3), it only needs
a relatively slow update.  For clock_gettime() with nansoeconds precision,
it is even more unsuitable.  For clock_gettime() with precisions between
1 second and 1 microseconds, it is intermediately unsuitable.

It also requires some complications for locking/atomicity and coherency
(much the same as in the kernel.  Not just for times.  For times, the
kernel handles locking/atomicity fairly well, and coherency fairly badly.

> The patch at
> http://www.trematerra.net/patches/ksvar_experimental.patch
>
> is in a very experimental stage. It's just a proof-of-concept.
> Only works for an AMD64 kernel and only for 64-bit applications.
> The idea is to have all the variables that we want to share between kernel
> and user space into one or more consecutive pages of memory that will be
> mapped read-only into every running process. At the start of the first
> shared page
> there'll be a table with as many entries as the number of the shared variables.
> Each entry is a 32-bit value that is the offset between the start of the shared
> page and the start of the variable in the page. The user space processes need
> to find out the map address of shared page and use the table to access to the
> shared variables.

On amd64, 2 32-bit values or 64-bit values with most bits 0 or 1 can be
packed/encoded into 1 64-bit value to give a certain atomicity without
locking.  The corresponding i386 packing into 1 32-bit value doesn't work
so well.

> ...

> Just as proof of concept I re-implemented gettimeofday(3) in user space.
> First of all I didn't remove the entry into the syscall.master, just renamed the
> sys_gettimeofday. I need it for the fallback path.
> In the kernel I introduced a struct wall_clock.
>
> +struct wall_clock
> +{
> +       struct timeval  tv;
> +       struct timezone tz;
> +};

This is much larger than 64 bits.  struct timezone is relatively unimportant.
struct timeval is bloated on amd64 (128 bits), but can be packed into 64
bits (works for a few hundred years).  On i386, it could be packed into
20 bits for tv_usec and 12 bits for an offset for tv_sec.

> Now we defined a shared variable at index WALL_CLOCK_INDEX of type
> struct wall_clock and named wall_clock.
> Inside handleevents I update the info exported by wall_clock.
>
> +       struct timeval tv;
> +
> +       /* update time for userspace gettimeofday */
> +       microtime(&tv);

This is supposed to have no races (microtime() uses a generation counter
to recover from them), and gives the necessary microseconds precision,
provided it is called every microsecond.  This doesn't quite require
a CPU spinning to update.  You could also unmap the variable 1
microsecond after it is accessed and take a pagefault to update it
for the next access.

> +       wall_clock.tv = tv;

This has races (when userland accesses it in the middle of the kernel
update).

> +       wall_clock.tz.tz_minuteswest = tz_minuteswest;
> +       wall_clock.tz.tz_dsttime = tz_dsttime;

This has races, but rarely changes.

> +int
> +gettimeofday(struct timeval *tp, struct timezone *tzp)
> +{
> +
> +       /* fallback to syscall if kernel doesn't export ksvar */
> +       if (!KSVAR_IS_ACTIVE())
> +               return (sys_gettimeofday(tp, tzp));
> +
> +       if (tp != NULL)
> +               *tp = KSVAR(wall_clock)->tv;

Races.  These can be fixed using a generation counter as in the kernel,
but it is much harder since we are not in control of the update times
and the updates must occur much more frequently.  In the kernel, the
critical updates occur only every 1-10 msec (except when timers are
stopped).  This gives an offset, and a difference is added to this,
with atomicity for the difference given even naturally by the hardware
or by locking the hardware.  Then generation count changes only every
1-10 msec, and itself is updated only by implicit serialization
instructions.  No ordering is enforced for stores and loads, but the
implicit serializations have 1-10 msec to flush the stores and loads,
in any order provided that they happen during this time, for the
generation count to work.

> +       if (tzp != NULL)
> +               *tzp = KSVAR(wall_clock)->tz;

Minor races.

> +       return (0);
> +}

I don't see how to implement gettimeofday() in userland without duplicating
most of the kernel's timecounter code.  The kernel can keep an offset,
updated every 1-10 msec.  Userland calls the hardware timecounter and
scales it delicately as in the kernel, but with even more complications
to synchronize with ntpd micro-adjustments and other threads and
timecounter hardware.

> Now when a process will call getimeofday, will call that function actually.
> If the process makes a lot of call to gettimeofday, we will see a
> performance boost.

Without the above problems being solved, we would see it not working
precisely when it is used a lot.  It would report time differences of
0 (or occasionally garbage from races) for events separated by hundreds
of thousands of nanoseconds.  gettimeofday()'s precision and accuracy
are well established, so there must be many programs that depend on
them.  The newer clock_gettime() interfaces give more choice and must
be used if sub-microsecond precision is wanted and sub-microsecond
accuracy is hoped for (with normal X86 hardware, an accuracy of ~10
nanosecond is nearly possible for small differences in relative times
but not for absolute times).  But again, the low precision clock ids
aren't very useful, since if you actually use them enough to notice
their slowness, then they won't distinguish different events.  These
ids are especially bogus when clock_gettime() is implemented as a
syscall, since the syscall overhead is about 90% of the total overhead
provided the timecounter hardware is efficient
(clock_getttime(CLOCK_MONOTONIC, ...) might take 300 nsec, while
clock_gettime(CLOCK_MONOTONIC_FAST_N_BROKEN, ...) takes 330 nsec.  Then
you may as well use precise version).  But if
clock_gettime(CLOCK_MONOTONIC, ...) is a syscall as it needs to be for
good precision, while
clock_gettime(CLOCK_MONOTONIC_FAST_N_BROKEN, ...) is memory mapped, then
the latter becomes almost useful.

BTW, the FAST_N_BROKEN clock ids advertise bogus precision in
clock_getres(), so you can't tell (except from their names) how much
worse their precision is than that of the unbroken versions.
   (clock_getres() is bogusly named (POSIX standard), since it returns
   the precision (the resolution is 1 nsec for all timespec
   interfaces).  POSIX clearly doesn't intend for clock_getres() to
   actually return the resolution, since that is known at compile time
   and parts of the specification of clock_getres() require precision
   semantics.  Many parts are confused in other ways, at least in 2001
   draft7, and are not implemented in FreeBSD.  E.g., clock_settime()
   is specified to truncate the time to a multiple of the "resolution"
   returned by clock_getres() with the same id.  This more or less
   assumes that the clock is in hardware ticks, with no
   micro-adjustments.  FreeBSD does micro-adjustments and delicate
   scaling of hardware clocks, with a low-level resolution of 2**-64
   seconds, and doesn't truncate the time in clock_settime().  File
   times are more interesting, since they often have to be truncated
   to fit on disks; POSIX only started specifying the details of this
   about 4 years ago.  In this mail, I try to use "precision"
   consistently instead of "resolution", except where describing POSIX
   bugs.)
For old clock ids, FreeBSD returns something reasonable (the hardware
or virtual clock update period, rounded up), except for CLOCK_PROF and
CLOCK_VIRTUAL, it returns the period of the unrelated hz clock (the
actual clock is an impossible-to-express combination of the stathz
clock for sampling the user+sys decomposition, the cputick clock for
the total time, and these limited by the resolution of the
getrusage()/calcru() timeval-based API,  1 usec for the tiemval
resolution is probably best on fast machine, but I use 1/stathz
seconds.)  For newer clock ids, FreeBSD the precision of the precise
clock for all the non-precise clocks (the latter have a virtual clock
update period of tc_tick/hz seconds, so they should advertise that as
their precision).  CLOCK_THREAD_CPUTIME_ID uses the cputicker as its
basic clock, but rounds to usec, so its precision is an integral
multiple of 1000 nsec.  The scaling for this is quite different and
broken than that for the old clock ids.  It rounds down instead of
up, and then rounds up to 1000 nsec iff the previous value was 0.
The first scaling step has the nonsense factor of 1000000 instead of
1000000000, and the result is accidentally correct in the usual case
where the cputicker frequency is < 1000000, else garbage (not a multiple
of 1000, and usually too small):

% int
% kern_clock_getres(struct thread *td, clockid_t clock_id, struct timespec *ts)
% {
% 
% 	ts->tv_sec = 0;
% 	switch (clock_id) {
% 	case CLOCK_REALTIME:
% 	case CLOCK_REALTIME_FAST:
% 	case CLOCK_REALTIME_PRECISE:
% 	case CLOCK_MONOTONIC:
% 	case CLOCK_MONOTONIC_FAST:
% 	case CLOCK_MONOTONIC_PRECISE:
% 	case CLOCK_UPTIME:
% 	case CLOCK_UPTIME_FAST:
% 	case CLOCK_UPTIME_PRECISE:

Most of these shouldn't be here (or anywhere).

% 		/*
% 		 * Round up the result of the division cheaply by adding 1.
% 		 * Rounding up is especially important if rounding down
% 		 * would give 0.  Perfect rounding is unimportant.
% 		 */
% 		ts->tv_nsec = 1000000000 / tc_getfrequency() + 1;

Correct value for CLOCK_REALTIME and CLOCK_MONOTONIC.  Also for unportable
aliases of these, but those shouldn't exist.

% 		break;
% 	case CLOCK_VIRTUAL:
% 	case CLOCK_PROF:
% 		/* Accurately round up here because we can do so cheaply. */
% 		ts->tv_nsec = (1000000000 + hz - 1) / hz;

Should use stathz ore maybe just the resolution of a timeval (except when
the precision is limited by the cputicker more than by timevals, use the
former limit).

% 		break;
% 	case CLOCK_SECOND:
% 		ts->tv_sec = 1;
% 		ts->tv_nsec = 0;
% 		break;

Correct.  Unlike most cases, clock_gettime() with this returns an integral
multiple of the precision.  For hardware clock ids with precise hardware,
we don't really know the precision and don't want to round to it (it
will be a small number of nsec, perhaps 0, plus a fraction of a nanosec,
and we don't want to round everything based on these fractions).  But for
software clock ids, the resolution will be about 1-10msec and rounding to
a multiple of 1 or 10 msec would be good if that is the update frequency
(it will often be 1/hz with hz a nice multiple of 10, except for inaccuracies
in the interrupt frequency).

% 	case CLOCK_THREAD_CPUTIME_ID:
% 		/* sync with cputick2usec */
% 		ts->tv_nsec = 1000000 / cpu_tickrate();
% 		if (ts->tv_nsec == 0)
% 			ts->tv_nsec = 1000;
% 		break;

Nonsense scaling.  When cpu_tickrate() <= 1000000, then result is too
small by a factor of 1000 and often invalid since it is not a multiple
of 1000.  But on most or all supported arches, cpu_tickrate() is >
1000000 so the result of the division is 0 and this is fixed up to
the correct value of 1000.

The correct scaling is more like the above:
- use the same scale factor as above.  Here we are scaling to nsec, so
   it is almost irrelvant that cputick2usec() scales to usec.  We should
   just use 1000 from cputick2usec()'s limit on the precision in the
   (usual) case that that limit is stricter (higher) than the one got
   by correct scaling here
- when scaling to nsec, round up as above, not down as this does now
- when scaling to nsec, round up efficiently as above, not with branching
   logic as this does now.  But we probably need the branching logic to
   clamp up to 1000.
Note that clock_gettime(CLOCK_THREAD_CPUTIME_ID, ...) scales to usec
just to use old KPIs which only support timevals, although
clock_gettime() uses timespecs.

% 	default:
% 		return (EINVAL);
% 	}
% 	return (0);
% }

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Fri Jun  1 21:42:41 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 4ADA7106566B;
	Fri,  1 Jun 2012 21:42:41 +0000 (UTC)
	(envelope-from giovanni.trematerra@gmail.com)
Received: from mail-qa0-f47.google.com (mail-qa0-f47.google.com
	[209.85.216.47])
	by mx1.freebsd.org (Postfix) with ESMTP id B57408FC15;
	Fri,  1 Jun 2012 21:42:40 +0000 (UTC)
Received: by qabg1 with SMTP id g1so677010qab.13
	for <multiple recipients>; Fri, 01 Jun 2012 14:42:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type;
	bh=1b+jNglZtbxjmC/T2jxHqzQUByHT4QP3SQaqaclj0LA=;
	b=UxZZFTPtRZLOfjplU5Vter32S+ZMyTi1K5bFSf7LvXxeIjwz6RHrv16Oduj/vg8AaE
	dyVvGfAtQ1DMoV49dyldbiqVBaVDFWv2QWzCR26Fu2T6debRcfTy7dvjVoVn2vVl4xZs
	4ae6n2ff3Yo1y2DBNqn7V6iztPTVEUNZ35w9MvYYrXGmT2MRqYl8Ixed3E2u49/c259R
	fuVHJn/q0ySp5Rnuv23nQvaODt0eraUn1WgpVhR94WfT3j/YwWaZSaeHLnAypulpJePP
	ZF3b6gjlmx7enPjSDAtLzxRj6yr2c13G5yacQz8SVHnXf7Nb1UGW6FYK3EBoF/+fD0vb
	DZaQ==
MIME-Version: 1.0
Received: by 10.224.181.134 with SMTP id by6mr5853096qab.56.1338586954256;
	Fri, 01 Jun 2012 14:42:34 -0700 (PDT)
Received: by 10.229.160.20 with HTTP; Fri, 1 Jun 2012 14:42:34 -0700 (PDT)
In-Reply-To: <20120601193522.GA2358@deviant.kiev.zoral.com.ua>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
	<20120601193522.GA2358@deviant.kiev.zoral.com.ua>
Date: Fri, 1 Jun 2012 23:42:34 +0200
Message-ID: <CACfq092DUHiP42_bW5xSGKPBsiDrQLUPovuxGJfb-9i8e2Gd2A@mail.gmail.com>
From: Giovanni Trematerra <giovanni.trematerra@gmail.com>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Cc: Attilio Rao <attilio@freebsd.org>, alc@freebsd.org,
	Alexander Kabaev <kan@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Jun 2012 21:42:41 -0000

On Fri, Jun 1, 2012 at 9:35 PM, Konstantin Belousov <kostikbel@gmail.com> wrote:
> On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote:
>> Hello,
>> I'd like to discuss a way to provide a mechanism to share some read-only
>> data between kernel and user space programs avoiding syscall overhead,
>> implementing some them, such as gettimeofday(3) and time(3) as ordinary
>> user space routine.
>>
>> The patch at
>> http://www.trematerra.net/patches/ksvar_experimental.patch
>>
>
> I very much dislike what you described, it makes ABI maintanence
> a nightmare.

While I respect your decision to dislike my work, could you please give me
some concrete examples of ABI maintenance nightmare? I mean not based
on speculation.

> Below is some mail I wrote around Spring 2009, making some notes about
> desired proposal.

I wonder why your proposal isn't on the Ideas Page wiki.
By the way, is this proposal valid?
http://wiki.freebsd.org/IdeasPage#Avoiding_syscall_overhead_.28GSoC.29

> This is what called vdso in Linux land.

I know what vdso is in Linux land but while implementing vdso will give us some
additional features in any case it needs a mechanism like the one
I implemented (ksvar) to access kernel data while in user space and I think my
implementation isn't too much different from what is called VVAR in
linux parlance.
Please take a look at
http://fxr.watson.org/fxr/source/arch/x86/include/asm/vvar.h?v=linux-2.6
http://fxr.watson.org/fxr/source/arch/x86/vdso/vclock_gettime.c?v=linux-2.6;im=10

Could you please review the kernel part of the patch?

--
Gianni

From owner-freebsd-arch@FreeBSD.ORG  Fri Jun  1 22:23:43 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BDEB01065670;
	Fri,  1 Jun 2012 22:23:43 +0000 (UTC)
	(envelope-from giovanni.trematerra@gmail.com)
Received: from mail-qa0-f47.google.com (mail-qa0-f47.google.com
	[209.85.216.47])
	by mx1.freebsd.org (Postfix) with ESMTP id 1D5158FC12;
	Fri,  1 Jun 2012 22:23:42 +0000 (UTC)
Received: by qabg1 with SMTP id g1so692044qab.13
	for <multiple recipients>; Fri, 01 Jun 2012 15:23:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	bh=XEteMbH8tH0s92292qHmaUoLnGd0r/RCe99bVApaSaU=;
	b=Lu4IaRzeZdb76tjcTW3tulB2eQtIWCarzwabDYjO62yuvC+0Tw9MPrspwxwTtPKTwk
	ABcM/lO1QRx/BoaVMXx+5okHFgRT6VLVoEVPjut5Q1fTawFbiP4e2xbwPfShwcEp0oJw
	KJXm4PduB/9zvB8IqllSf7qItJoADLZM/f75+EzY3z61U/mTnwoB5uL6migsH1k1AU3Y
	Ev8k5kiboo633CZX3KlfPGT2K8Dxmn3CTD6HStoPaQPxtEbhdiH3oYSaJ2vOstOv74iD
	dofNu2JhW4FsAwCOOsqg+tAPPO3jOnIqc6+eXyoe/ErlvOvo93ZV3tv32leBlXzJFltH
	CNeA==
MIME-Version: 1.0
Received: by 10.224.187.147 with SMTP id cw19mr6012337qab.47.1338589422333;
	Fri, 01 Jun 2012 15:23:42 -0700 (PDT)
Received: by 10.229.160.20 with HTTP; Fri, 1 Jun 2012 15:23:42 -0700 (PDT)
In-Reply-To: <20120602044306.S4049@besplex.bde.org>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
	<20120602044306.S4049@besplex.bde.org>
Date: Sat, 2 Jun 2012 00:23:42 +0200
Message-ID: <CACfq092qpnack2qhq+7mdNpomFiOVnbWR3XbQm-yZB1A6b6P7w@mail.gmail.com>
From: Giovanni Trematerra <giovanni.trematerra@gmail.com>
To: Bruce Evans <brde@optusnet.com.au>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: Attilio Rao <attilio@freebsd.org>, alc@freebsd.org,
	Konstantin Belousov <kib@freebsd.org>,
	Alexander Kabaev <kan@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Jun 2012 22:23:43 -0000

On Fri, Jun 1, 2012 at 11:21 PM, Bruce Evans <brde@optusnet.com.au> wrote:
> On Fri, 1 Jun 2012, Giovanni Trematerra wrote:
>
>> I'd like to discuss a way to provide a mechanism to share some read-only
>> data between kernel and user space programs avoiding syscall overhead,
>> implementing some them, such as gettimeofday(3) and time(3) as ordinary
>> user space routine.
>
>
> This is particularly unsuitable for implementing gettimeofday(), since fo=
r
> it to work you would need to use approximately 1 CPU spinning in the
> kernel to update the time every microsecond. =A0For time(3), it only need=
s
> a relatively slow update. =A0For clock_gettime() with nansoeconds precisi=
on,
> it is even more unsuitable. =A0For clock_gettime() with precisions betwee=
n
> 1 second and 1 microseconds, it is intermediately unsuitable.
>
> It also requires some complications for locking/atomicity and coherency
> (much the same as in the kernel. =A0Not just for times. =A0For times, the
> kernel handles locking/atomicity fairly well, and coherency fairly badly.
>

Well, the primary intend of the patch is to provide a mechanism to share da=
ta
between kernel and user land without switching in kernel mode. Not to provi=
de
a complete re-implementation in user mode of all time stuff.

>
>> The patch at
>> http://www.trematerra.net/patches/ksvar_experimental.patch
>>
>> is in a very experimental stage. It's just a proof-of-concept.
>> Only works for an AMD64 kernel and only for 64-bit applications.
>> The idea is to have all the variables that we want to share between kern=
el
>> and user space into one or more consecutive pages of memory that will be
>> mapped read-only into every running process. At the start of the first
>> shared page
>> there'll be a table with as many entries as the number of the shared
>> variables.
>> Each entry is a 32-bit value that is the offset between the start of the
>> shared
>> page and the start of the variable in the page. The user space processes
>> need
>> to find out the map address of shared page and use the table to access t=
o
>> the
>> shared variables.
>
>
> On amd64, 2 32-bit values or 64-bit values with most bits 0 or 1 can be
> packed/encoded into 1 64-bit value to give a certain atomicity without
> locking. =A0The corresponding i386 packing into 1 32-bit value doesn't wo=
rk
> so well.

These values are written just one time during a SYSINIT routine and are onl=
y
read by user processes.

>
>> ...
>
>
>> Just as proof of concept I re-implemented gettimeofday(3) in user space.
>> First of all I didn't remove the entry into the syscall.master, just
>> renamed the
>> sys_gettimeofday. I need it for the fallback path.
>> In the kernel I introduced a struct wall_clock.
>>
>> +struct wall_clock
>> +{
>> + =A0 =A0 =A0 struct timeval =A0tv;
>> + =A0 =A0 =A0 struct timezone tz;
>> +};
>
>
> This is much larger than 64 bits. =A0struct timezone is relatively
> unimportant.
> struct timeval is bloated on amd64 (128 bits), but can be packed into 64
> bits (works for a few hundred years). =A0On i386, it could be packed into
> 20 bits for tv_usec and 12 bits for an offset for tv_sec.
>

Thanks a lot for your explanation. I think they will be precious as a refer=
ence.
Nonetheless I just wrote gettimeofday in that way just as proof-of-concept,
just to show how things could be supposed to work, it didn't mean to be cor=
rect.
I think it was just unfortunate to have choose gettimeofday.
I'm most interested in the VM things of the patch.

--
Gianni

From owner-freebsd-arch@FreeBSD.ORG  Sat Jun  2 00:11:29 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id BE66B106564A;
	Sat,  2 Jun 2012 00:11:29 +0000 (UTC)
	(envelope-from julian@freebsd.org)
Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16])
	by mx1.freebsd.org (Postfix) with ESMTP id 4D5FF8FC24;
	Sat,  2 Jun 2012 00:11:26 +0000 (UTC)
Received: from JRE-MBP-2.local (c-67-180-24-15.hsd1.ca.comcast.net
	[67.180.24.15]) (authenticated bits=0)
	by vps1.elischer.org (8.14.5/8.14.5) with ESMTP id q520B1QL038993
	(version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
	Fri, 1 Jun 2012 17:11:03 -0700 (PDT)
	(envelope-from julian@freebsd.org)
Message-ID: <4FC95A10.7000806@freebsd.org>
Date: Fri, 01 Jun 2012 17:10:56 -0700
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
	rv:12.0) Gecko/20120428 Thunderbird/12.0.1
MIME-Version: 1.0
To: Bryan Drewery <bryan@shatow.net>
References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no>
	<CAJ-VmokY+pgcq999NHShbq-3rK3=oeWT2WY7NmTvVdXOHZJhdg@mail.gmail.com>
	<CAF6rxgmDW21aPJ5Mp6Tbk1z02ivw4UPhSaNEX+Wiu7O0v13skA@mail.gmail.com>
	<20120517055425.GA802@infradead.org>
	<4FC762DD.90101@FreeBSD.org> <4FC81D9C.2080801@FreeBSD.org>
	<4FC8E29F.2010806@shatow.net>
In-Reply-To: <4FC8E29F.2010806@shatow.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: =?ISO-8859-1?Q?Dag-Erling_Sm=F8?=@freebsd.org,
	Adrian Chadd <adrian@freebsd.org>,
	Doug Barton <dougb@freebsd.org>, d@delphij.net,
	Andriy Gapon <avg@freebsd.org>,
	Eitan Adler <lists@eitanadler.com>, freebsd-arch@freebsd.org,
	rgrav <des@des.no>
Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged
	process?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Jun 2012 00:11:29 -0000

On 6/1/12 8:41 AM, Bryan Drewery wrote:
> On 5/31/2012 8:40 PM, Doug Barton wrote:
>> On 5/31/2012 5:23 AM, Andriy Gapon wrote:
>>> In fact, FreeBSD also has this rlimit and there seems to be full support for it on
>>> both user and kernel sides.
>>> OTOH, PRIV_VM_MLOCK privilege seems to be granted only to the super-user in the
>>> default configuration.  And this privilege kind of defeats the limit.
>>>
>>> Perhaps, we should/could kill the privilege and set the limit to a sufficiently
>>> small/safe value for ordinary users?
>> I like this idea, but someone else in the thread (sorry, don't have it
>> handy) brought up the point that we don't want the aggregate of per-user
>> limits to be able to bring down the system either. So the right solution
>> would seem to be a reasonable per-user limit, and a cap on the maximum
>> total amount of locked pages for all unprivileged users, probably based
>> on some percentage of total available memory?
>>
>> Doug
>>
> I like this approach. A per-user ulimit, and a global max sysctl that
> can be overridden, but by default based on a percentage of available memory.

I'd go a different route.
I'd have it inherited, and I'd have the value be 0 by default, but 
settable to
some different value at login.conf, or by an ancestor with root privs.


From owner-freebsd-arch@FreeBSD.ORG  Sat Jun  2 11:03:51 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 27D56106576A;
	Sat,  2 Jun 2012 11:03:51 +0000 (UTC) (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 0A3038FC16;
	Sat,  2 Jun 2012 11:03:46 +0000 (UTC)
Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua
	[212.40.38.100])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id OAA23522;
	Sat, 02 Jun 2012 14:03:44 +0300 (EEST)
	(envelope-from avg@FreeBSD.org)
Received: from localhost ([127.0.0.1])
	by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
	id 1Sam7U-000ILr-Bu; Sat, 02 Jun 2012 14:03:44 +0300
Message-ID: <4FC9F30E.4030205@FreeBSD.org>
Date: Sat, 02 Jun 2012 14:03:42 +0300
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
	rv:12.0) Gecko/20120503 Thunderbird/12.0.1
MIME-Version: 1.0
To: Doug Barton <dougb@FreeBSD.org>
References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no>
	<CAJ-VmokY+pgcq999NHShbq-3rK3=oeWT2WY7NmTvVdXOHZJhdg@mail.gmail.com>
	<CAF6rxgmDW21aPJ5Mp6Tbk1z02ivw4UPhSaNEX+Wiu7O0v13skA@mail.gmail.com>
	<20120517055425.GA802@infradead.org>
	<4FC762DD.90101@FreeBSD.org> <4FC81D9C.2080801@FreeBSD.org>
In-Reply-To: <4FC81D9C.2080801@FreeBSD.org>
X-Enigmail-Version: 1.5pre
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: =?ISO-8859-1?Q?Dag-Erling_Sm=F8?=@FreeBSD.org,
	Adrian Chadd <adrian@FreeBSD.org>, d@delphij.net,
	Eitan Adler <lists@eitanadler.com>, freebsd-arch@FreeBSD.org,
	rgrav <des@des.no>
Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged
	process?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Jun 2012 11:03:51 -0000

on 01/06/2012 04:40 Doug Barton said the following:
> On 5/31/2012 5:23 AM, Andriy Gapon wrote:
>> In fact, FreeBSD also has this rlimit and there seems to be full support for it on
>> both user and kernel sides.
>> OTOH, PRIV_VM_MLOCK privilege seems to be granted only to the super-user in the
>> default configuration.  And this privilege kind of defeats the limit.
>>
>> Perhaps, we should/could kill the privilege and set the limit to a sufficiently
>> small/safe value for ordinary users?
> 
> I like this idea, but someone else in the thread (sorry, don't have it
> handy) brought up the point that we don't want the aggregate of per-user
> limits to be able to bring down the system either.


The unprivileged users can not spawn any new users on their own and there is a
limit on number of processes per user, so a system administrator should be able
to plan resource limits based on system capacity and utilization.

> So the right solution
> would seem to be a reasonable per-user limit, and a cap on the maximum
> total amount of locked pages for all unprivileged users, probably based
> on some percentage of total available memory?

I would agree for a default limit of zero even.  As long as I (as a system
administrator) am able to increase it for selected users and groups.

-- 
Andriy Gapon

From owner-freebsd-arch@FreeBSD.ORG  Sat Jun  2 11:30:24 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 74216106566C;
	Sat,  2 Jun 2012 11:30:24 +0000 (UTC) (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 9F94B8FC0C;
	Sat,  2 Jun 2012 11:30:22 +0000 (UTC)
Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua
	[212.40.38.100])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id OAA23639;
	Sat, 02 Jun 2012 14:30:20 +0300 (EEST)
	(envelope-from avg@FreeBSD.org)
Received: from localhost ([127.0.0.1])
	by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
	id 1SamXE-000INm-Ht; Sat, 02 Jun 2012 14:30:20 +0300
Message-ID: <4FC9F94B.8060708@FreeBSD.org>
Date: Sat, 02 Jun 2012 14:30:19 +0300
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
	rv:12.0) Gecko/20120503 Thunderbird/12.0.1
MIME-Version: 1.0
To: Julian Elischer <julian@FreeBSD.org>
References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no>
	<CAJ-VmokY+pgcq999NHShbq-3rK3=oeWT2WY7NmTvVdXOHZJhdg@mail.gmail.com>
	<CAF6rxgmDW21aPJ5Mp6Tbk1z02ivw4UPhSaNEX+Wiu7O0v13skA@mail.gmail.com>
	<20120517055425.GA802@infradead.org>
	<4FC762DD.90101@FreeBSD.org> <4FC81D9C.2080801@FreeBSD.org>
	<4FC8E29F.2010806@shatow.net> <4FC95A10.7000806@freebsd.org>
In-Reply-To: <4FC95A10.7000806@freebsd.org>
X-Enigmail-Version: 1.5pre
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: Adrian Chadd <adrian@FreeBSD.org>, Doug Barton <dougb@FreeBSD.org>,
	d@delphij.net, Eitan Adler <lists@eitanadler.com>,
	freebsd-arch@FreeBSD.org,
	=?ISO-8859-1?Q?Dag-Erling_Sm=F8rgrav?= <des@des.no>,
	Edward Tomasz Napierala <trasz@FreeBSD.org>,
	Bryan Drewery <bryan@shatow.net>
Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged
	process?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Jun 2012 11:30:24 -0000

on 02/06/2012 03:10 Julian Elischer said the following:
> I'd go a different route.
> I'd have it inherited, and I'd have the value be 0 by default, but settable to
> some different value at login.conf, or by an ancestor with root privs.

I think that this is how the limits work in general :-)
I agree that defaulting the limit to 0 for non-privileged users is a good idea,
at least at the beginning.  For super-user we might want to keep the limit uncapped.

Some further technical observations:
o  I was overly optimistic about _full_ support for RLIMIT_MEMLOCK - mlockall()
doesn't support itat the moment and I am not sure if it is easy to implement the
support for the MCL_FUTURE case.

o  Currently the default class in default login.conf has memorylocked=unlimited
- not very smart.

o  There is also vm.max_wired sysctl (with no equivalent tunable), which
specifies number of _pages_ that can be wired system wide (by both kernel and
userland).  But note that the limit applies only to userland requests, the
kernel is allowed to wire new pages even when the limit is exceeded.  By default
the limit is set to 1/3 of available pages.
So watch out for this limit when using ZFS, ZFS can easily starve userland.

o  I've just discovered :-) that we also have RCTL/RACCT framework (not enabled
by default) aka "Resource Accounting" / "Resource Limits", which seems to
parallel the conventional limits in many categories including the locked memory.
 Not sure why we have that and if the interactions between conventional limits,
resource limits and privileges would be easy to untangle.

o  A general observation that our way of setting resource limits via login
classes (login.conf) seems to be inferior to limits.conf way of Linux.

More about the last point.
In addition to the traditional users and groups we also have login classes.
Initial (conventional) resource limits can be set only via login.conf, i.e. via
the classes.  The classes can be assigned only in master.passwd and thus only to
users.  So if I want to increase some limit for a group, then I have to create
(and maintain) a parallel class and assign that class to all users in question.
Now imagine a case of a user being in several groups.
Ability to specify the limits on per-user/per-group basis like it is done with
Linux limits.conf seems to be more convenient.
The new rctl framework also allows to set resource limits for "process, user,
login class, or jail".  'group' is missing from the list.

-- 
Andriy Gapon

From owner-freebsd-arch@FreeBSD.ORG  Sat Jun  2 13:01:43 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C0CD7106566B;
	Sat,  2 Jun 2012 13:01:43 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com
	[209.85.215.54])
	by mx1.freebsd.org (Postfix) with ESMTP id B0E898FC08;
	Sat,  2 Jun 2012 13:01:42 +0000 (UTC)
Received: by laai10 with SMTP id i10so2736144laa.13
	for <multiple recipients>; Sat, 02 Jun 2012 06:01:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	bh=qklQ6e84wildO7Fn6SCjujHT7KqOIGiFlwqTcBYNTK8=;
	b=XEBf4tPOr6vMdKjutq8EhV/LdcSvdG1SZhcACmtCNj0XBZFZmFdbmtcJPg83b/jfuZ
	4fBv1VYvZcbXdiaUCS0EYtLQ9m2brtz9mvQrJjwIvgFffVE20E/iVDDREQ00Y2XsWBYc
	PukzFEZ2F0/DPGm9Ok5FVi0ht6EiPZifWBVAgxB6JWKSo03xQ75+SdU4/dr71Fe8Joxn
	9EL5MWnrjXIAJGyYO3pHw2FaFWzVG6S+hwDg9MsHEqk8iUzNTXx3aEbjqmjVo1x+u7CQ
	kbtHAVEMTnJCXfQMF4VeuOjCj4BonRIGJt4iUTmjOe+NtyyQ2bt0J8iWtIttX+1T81Ya
	nsGA==
MIME-Version: 1.0
Received: by 10.152.135.105 with SMTP id pr9mr6535166lab.37.1338642095871;
	Sat, 02 Jun 2012 06:01:35 -0700 (PDT)
Sender: asmrookie@gmail.com
Received: by 10.112.27.65 with HTTP; Sat, 2 Jun 2012 06:01:35 -0700 (PDT)
In-Reply-To: <20120601193522.GA2358@deviant.kiev.zoral.com.ua>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
	<20120601193522.GA2358@deviant.kiev.zoral.com.ua>
Date: Sat, 2 Jun 2012 14:01:35 +0100
X-Google-Sender-Auth: RlMy4owDf64wdOP4xglAbXaJ6ao
Message-ID: <CAJ-FndC71=3Jo+BxQi==gCoLipBxj8X8XMBydjvrcKeGw+WOnA@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: alc@freebsd.org, Alexander Kabaev <kan@freebsd.org>,
	Giovanni Trematerra <giovanni.trematerra@gmail.com>,
	freebsd-arch@freebsd.org
Subject: Re: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Jun 2012 13:01:43 -0000

2012/6/1 Konstantin Belousov <kostikbel@gmail.com>:
> On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote:
>> Hello,
>> I'd like to discuss a way to provide a mechanism to share some read-only
>> data between kernel and user space programs avoiding syscall overhead,
>> implementing some them, such as gettimeofday(3) and time(3) as ordinary
>> user space routine.
>>
>> The patch at
>> http://www.trematerra.net/patches/ksvar_experimental.patch
>>
>> is in a very experimental stage. It's just a proof-of-concept.
>> Only works for an AMD64 kernel and only for 64-bit applications.
>> The idea is to have all the variables that we want to share between kern=
el
>> and user space into one or more consecutive pages of memory that will be
>> mapped read-only into every running process. At the start of the first
>> shared page
>> there'll be a table with as many entries as the number of the shared var=
iables.
>> Each entry is a 32-bit value that is the offset between the start of the=
 shared
>> page and the start of the variable in the page. The user space processes=
 need
>> to find out the map address of shared page and use the table to access t=
o the
>> shared variables.
>> Kernel will export a variable to user space as an index, so user space c=
ode
>> must refer to a specific index to access a kernel shared variable.
>> Let's take a quick look to the KPI/API for exporting/importing kernel
>> shared variables.
>> Say we want implement a routine to export an int from the kernel.
>> To define the variable to be exported inside the kernel you would use
>>
>> KSVAR_DEFINE(0, int, test_value);
>>
>> You have just defined an int variable named "test_value" at index 0.
>> Inside the kernel you can write/read as usual using the symbol test_valu=
e;
>> Now you likely want add to libc a function callable from user processes
>> that return the test_value variable. So first of all you need the import=
 the
>> variable.
>>
>> KSVAR_IMPORT(0, int, test_value);
>>
>> and to obtain a pointer to read the value you would use
>>
>> KSVAR(test_value);
>>
>> so your function would look like something like this
>>
>> int get_test_value()
>> {
>>
>> =C2=A0 =C2=A0 =C2=A0return (*KSVAR(test_value));
>> }
>>
>> Then inside your process just call get_test_value() function as you usua=
lly
>> do and you'll get a kernel written value without switching in kernel mod=
e.
>>
>> Let's see now in more detail how that could be accomplished.
>> The shared variables will be accessed as normal variables and are read/w=
rite
>> inside the kernel. The variables need to be inside the same page(s) and =
nothing
>> but the shared variables (and the table) must be into the page(s). To
>> obtain that
>> I changed the linker script in this way
>>
>> --- a/sys/conf/ldscript.amd64
>> +++ b/sys/conf/ldscript.amd64
>> @@ -177,6 +177,15 @@ SECTIONS
>> =C2=A0 =C2=A0 *(.ldata .ldata.* .gnu.linkonce.l.*)
>> =C2=A0 =C2=A0 . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1);
>> =C2=A0 }
>> + =C2=A0.ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) :
>> + =C2=A0{
>> + =C2=A0 =C2=A0__ksvar_set_start =3D .;
>> + =C2=A0 =C2=A0*(.ksvar_table)
>> + =C2=A0 =C2=A0*(.ksvar)
>> +
>> + =C2=A0 . =3D ALIGN(CONSTANT (COMMONPAGESIZE));
>> + =C2=A0 __ksvar_set_stop =3D .;
>> + =C2=A0}
>> =C2=A0 . =3D ALIGN(64 / 8);
>> =C2=A0 _end =3D .; PROVIDE (end =3D .);
>> =C2=A0 . =3D DATA_SEGMENT_END (.);
>>
>> When we want to define a variable in the kernel to share with user space
>> we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h
>>
>> +struct ksvar_set {
>> + =C2=A0 =C2=A0 =C2=A0 uint32_t idx;
>> + =C2=A0 =C2=A0 =C2=A0 char *pksvar;
>> +};
>> +
>> +/*
>> + * Declare a variable into kernel shared linker_set.
>> + */
>> +#define =C2=A0 =C2=A0 =C2=A0 =C2=A0KSVAR_DEFINE(index, type, name) \
>> + =C2=A0 =C2=A0 =C2=A0 static type name __section(".ksvar"); =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \
>> + =C2=A0 =C2=A0 =C2=A0 static struct ksvar_set name ## _ksvar_set =3D { =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .idx =3D index, =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .pksvar =3D (char *) =
&name =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0\
>> + =C2=A0 =C2=A0 =C2=A0 }; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\
>> + =C2=A0 =C2=A0 =C2=A0 DATA_SET(ksvar_set, name ## _ksvar_set)
>>
>> Every variable must have a unique index. The indexes must
>> start from zero and be consecutive. When you add an index
>> you must bump the size of the table (KSVAR_TABLE_SIZE)
>> (see sys/sys/ksvar.h)
>>
>> The variables are inside the kernel static image that isn't managed
>> by the VM and so we need to allocate pages to map the physical addresses=
.
>> A new SYSINIT (ksvarinit) will allocate a set of vm_page_t =C2=A0through
>> the vm_phys_fictitious_reg_range interface and fill the table using
>> the information
>> of the ksvar_set linker set, then will create a vm_object_t (vm_object_k=
svar),
>> mark the fake pages as valid and put them into it.
>> When a new process is created by exec(3) the vm_object_ksvar will be
>> mapped read-only into the process address space by vm_map_fixed routine
>> just before mapping the user stack. The address of mapping will be recor=
ded
>> inside the new p_ksvar field of the struct proc.
>> This field will be exported through a sysctl to the user space processes=
.
>> In order to implement syscalls as user space routines, we have to find o=
ut the
>> mapped address of the kernel shared variables when the libc is mapped in=
to
>> the process. So I added a function marked with the attribute constructor=
.
>> It will called before any code into user process and before any code ins=
ide
>> the libc.
>>
>> +__attribute((constructor)) void init_kernel_shared()
>> +{
>> + =C2=A0 =C2=A0 =C2=A0 int mib[2];
>> + =C2=A0 =C2=A0 =C2=A0 size_t len;
>> + =C2=A0 =C2=A0 =C2=A0 vm_offset_t ksvar_address;
>> +
>> + =C2=A0 =C2=A0 =C2=A0 mib[0] =3D CTL_KERN;
>> + =C2=A0 =C2=A0 =C2=A0 mib[1] =3D KERN_KSVAR;
>> + =C2=A0 =C2=A0 =C2=A0 len =3D sizeof(vm_offset_t);
>> + =C2=A0 =C2=A0 =C2=A0 if (__sysctl(mib, 2, (void *) &ksvar_address, &le=
n, NULL, 0) !=3D -1)
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ksvar_table =3D (uint=
32_t *) ksvar_address;
>> +}
>>
>> Once the libc knows the address of the table it can access to the shared
>> variables.
>>
>> Just as proof of concept I re-implemented gettimeofday(3) in user space.
>> First of all I didn't remove the entry into the syscall.master, just ren=
amed the
>> sys_gettimeofday. I need it for the fallback path.
>> In the kernel I introduced a struct wall_clock.
>>
>> +struct wall_clock
>> +{
>> + =C2=A0 =C2=A0 =C2=A0 struct timeval =C2=A0tv;
>> + =C2=A0 =C2=A0 =C2=A0 struct timezone tz;
>> +};
>>
>> The struct is exported through sys/sys/time.h header.
>> I defined a new kernel shared variable. To do so I added an index in
>> sys/sys/ksvar.h
>> WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1.
>> In the sys/kern/kern_clocksource.c
>>
>> +/* kernel shared variable for implmenting gettimeofday. */
>> +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
>>
>> Now we defined a shared variable at index WALL_CLOCK_INDEX of type
>> struct wall_clock and named wall_clock.
>> Inside handleevents I update the info exported by wall_clock.
>>
>> + =C2=A0 =C2=A0 =C2=A0 struct timeval tv;
>> +
>> + =C2=A0 =C2=A0 =C2=A0 /* update time for userspace gettimeofday */
>> + =C2=A0 =C2=A0 =C2=A0 microtime(&tv);
>> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tv =3D tv;
>> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_minuteswest =3D tz_minuteswest;
>> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_dsttime =3D tz_dsttime;
>>
>> Now, in libc we import the shared variable
>>
>> +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
>>
>> note that WALL_CLOCK_INDEX must be the same of the one defined
>> inside the kernel, and define a new function gettimeofday
>>
>> +int
>> +gettimeofday(struct timeval *tp, struct timezone *tzp)
>> +{
>> +
>> + =C2=A0 =C2=A0 =C2=A0 /* fallback to syscall if kernel doesn't export k=
svar */
>> + =C2=A0 =C2=A0 =C2=A0 if (!KSVAR_IS_ACTIVE())
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return (sys_gettimeof=
day(tp, tzp));
>> +
>> + =C2=A0 =C2=A0 =C2=A0 if (tp !=3D NULL)
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tp =3D KSVAR(wall_cl=
ock)->tv;
>> + =C2=A0 =C2=A0 =C2=A0 if (tzp !=3D NULL)
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tzp =3D KSVAR(wall_c=
lock)->tz;
>> + =C2=A0 =C2=A0 =C2=A0 return (0);
>> +}
>>
>> Now when a process will call getimeofday, will call that function actual=
ly.
>> If the process makes a lot of call to gettimeofday, we will see a
>> performance boost.
>> Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE),
>> the function
>> fallback to call the actual syscall (sys_gettimeofday).
>>
>> Open tasks
>> - implement support for 32-bit emulated processes running in a 64-bit
>> environment.
>> - extend support to others arch
>> - implement more syscalls
>> - benchmarks
>> - Test, test, test.
>>
>> I'm looking forward to hear about your comments and suggestions.
>
> I very much dislike what you described, it makes ABI maintanence
> a nightmare.
> Below is some mail I wrote around Spring 2009, making some notes about
> desired proposal. This is what called vdso in Linux land.

Did you bother to read at least Giovanni's description?
Because this has nothing to do with VDSO in Linux.

I think, he just wants to map in userland processes some pages from
the static image of the kernel (packed together in a specific
dataset). This imposes some non-trivial problem. The first thing is
that the static image is not thought to have physical pages tied to
it. The second is that he needs to make a clean design in order to let
consumer of this mechanism to correctly locate informations they want
within the shared page(s) and in the end read the correct values.

I have some reservations on both the implementation and the approach
for retrieving datas from the page.
In particular, I don't like that a new vm_object is allocated for this
page. What I really would like would be:
1) very minimal implementation -- you just use
pmap_enter()/pmap_remove() specifically when needed, separately, in
fork(), execve(), etc. cases
2) more complete approach -- you make a very quick layer which let you
map pages from the static image of the kernel and the shared page
becomes just a specific consumer of this. This way the object has much
more sense because it becomes an object associated to all the static
image of the kernel

About the layering, I don't like that you require both a kernel and
userland header to locate the objects within the page. This is very
likely ABI breakage prone. It is needed a mechanism for retrieving at
run time what Giovanni calls "indexes", or making it indexes-agnostic.

Attilio


--=20
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Sat Jun  2 16:49:09 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E34B01065676;
	Sat,  2 Jun 2012 16:49:09 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id 1F3BE8FC0A;
	Sat,  2 Jun 2012 16:49:07 +0000 (UTC)
Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q52GmlRo059366;
	Sat, 2 Jun 2012 19:48:47 +0300 (EEST)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id
	q52GmlPU074939; Sat, 2 Jun 2012 19:48:47 +0300 (EEST)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q52GmlkT074938; 
	Sat, 2 Jun 2012 19:48:47 +0300 (EEST)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Sat, 2 Jun 2012 19:48:47 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Attilio Rao <attilio@freebsd.org>
Message-ID: <20120602164847.GB2358@deviant.kiev.zoral.com.ua>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
	<20120601193522.GA2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndC71=3Jo+BxQi==gCoLipBxj8X8XMBydjvrcKeGw+WOnA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="gQmjQz8lQ7hwrZL9"
Content-Disposition: inline
In-Reply-To: <CAJ-FndC71=3Jo+BxQi==gCoLipBxj8X8XMBydjvrcKeGw+WOnA@mail.gmail.com>
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: alc@freebsd.org, Alexander Kabaev <kan@freebsd.org>,
	Giovanni Trematerra <giovanni.trematerra@gmail.com>,
	freebsd-arch@freebsd.org
Subject: Re: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Jun 2012 16:49:10 -0000


--gQmjQz8lQ7hwrZL9
Content-Type: text/plain; charset=koi8-r
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote:
> 2012/6/1 Konstantin Belousov <kostikbel@gmail.com>:
> > On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote:
> >> Hello,
> >> I'd like to discuss a way to provide a mechanism to share some read-on=
ly
> >> data between kernel and user space programs avoiding syscall overhead,
> >> implementing some them, such as gettimeofday(3) and time(3) as ordinary
> >> user space routine.
> >>
> >> The patch at
> >> http://www.trematerra.net/patches/ksvar_experimental.patch
> >>
> >> is in a very experimental stage. It's just a proof-of-concept.
> >> Only works for an AMD64 kernel and only for 64-bit applications.
> >> The idea is to have all the variables that we want to share between ke=
rnel
> >> and user space into one or more consecutive pages of memory that will =
be
> >> mapped read-only into every running process. At the start of the first
> >> shared page
> >> there'll be a table with as many entries as the number of the shared v=
ariables.
> >> Each entry is a 32-bit value that is the offset between the start of t=
he shared
> >> page and the start of the variable in the page. The user space process=
es need
> >> to find out the map address of shared page and use the table to access=
 to the
> >> shared variables.
> >> Kernel will export a variable to user space as an index, so user space=
 code
> >> must refer to a specific index to access a kernel shared variable.
> >> Let's take a quick look to the KPI/API for exporting/importing kernel
> >> shared variables.
> >> Say we want implement a routine to export an int from the kernel.
> >> To define the variable to be exported inside the kernel you would use
> >>
> >> KSVAR_DEFINE(0, int, test_value);
> >>
> >> You have just defined an int variable named "test_value" at index 0.
> >> Inside the kernel you can write/read as usual using the symbol test_va=
lue;
> >> Now you likely want add to libc a function callable from user processes
> >> that return the test_value variable. So first of all you need the impo=
rt the
> >> variable.
> >>
> >> KSVAR_IMPORT(0, int, test_value);
> >>
> >> and to obtain a pointer to read the value you would use
> >>
> >> KSVAR(test_value);
> >>
> >> so your function would look like something like this
> >>
> >> int get_test_value()
> >> {
> >>
> >> =9A =9A =9Areturn (*KSVAR(test_value));
> >> }
> >>
> >> Then inside your process just call get_test_value() function as you us=
ually
> >> do and you'll get a kernel written value without switching in kernel m=
ode.
> >>
> >> Let's see now in more detail how that could be accomplished.
> >> The shared variables will be accessed as normal variables and are read=
/write
> >> inside the kernel. The variables need to be inside the same page(s) an=
d nothing
> >> but the shared variables (and the table) must be into the page(s). To
> >> obtain that
> >> I changed the linker script in this way
> >>
> >> --- a/sys/conf/ldscript.amd64
> >> +++ b/sys/conf/ldscript.amd64
> >> @@ -177,6 +177,15 @@ SECTIONS
> >> =9A =9A *(.ldata .ldata.* .gnu.linkonce.l.*)
> >> =9A =9A . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1);
> >> =9A }
> >> + =9A.ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) :
> >> + =9A{
> >> + =9A =9A__ksvar_set_start =3D .;
> >> + =9A =9A*(.ksvar_table)
> >> + =9A =9A*(.ksvar)
> >> +
> >> + =9A . =3D ALIGN(CONSTANT (COMMONPAGESIZE));
> >> + =9A __ksvar_set_stop =3D .;
> >> + =9A}
> >> =9A . =3D ALIGN(64 / 8);
> >> =9A _end =3D .; PROVIDE (end =3D .);
> >> =9A . =3D DATA_SEGMENT_END (.);
> >>
> >> When we want to define a variable in the kernel to share with user spa=
ce
> >> we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h
> >>
> >> +struct ksvar_set {
> >> + =9A =9A =9A uint32_t idx;
> >> + =9A =9A =9A char *pksvar;
> >> +};
> >> +
> >> +/*
> >> + * Declare a variable into kernel shared linker_set.
> >> + */
> >> +#define =9A =9A =9A =9AKSVAR_DEFINE(index, type, name) \
> >> + =9A =9A =9A static type name __section(".ksvar"); =9A =9A =9A =9A =
=9A =9A =9A =9A =9A \
> >> + =9A =9A =9A static struct ksvar_set name ## _ksvar_set =3D { =9A =9A=
 =9A =9A =9A\
> >> + =9A =9A =9A =9A =9A =9A =9A .idx =3D index, =9A =9A =9A =9A =9A =9A =
=9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A \
> >> + =9A =9A =9A =9A =9A =9A =9A .pksvar =3D (char *) &name =9A =9A =9A =
=9A =9A =9A =9A =9A =9A =9A =9A =9A\
> >> + =9A =9A =9A }; =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =
=9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A\
> >> + =9A =9A =9A DATA_SET(ksvar_set, name ## _ksvar_set)
> >>
> >> Every variable must have a unique index. The indexes must
> >> start from zero and be consecutive. When you add an index
> >> you must bump the size of the table (KSVAR_TABLE_SIZE)
> >> (see sys/sys/ksvar.h)
> >>
> >> The variables are inside the kernel static image that isn't managed
> >> by the VM and so we need to allocate pages to map the physical address=
es.
> >> A new SYSINIT (ksvarinit) will allocate a set of vm_page_t =9Athrough
> >> the vm_phys_fictitious_reg_range interface and fill the table using
> >> the information
> >> of the ksvar_set linker set, then will create a vm_object_t (vm_object=
_ksvar),
> >> mark the fake pages as valid and put them into it.
> >> When a new process is created by exec(3) the vm_object_ksvar will be
> >> mapped read-only into the process address space by vm_map_fixed routine
> >> just before mapping the user stack. The address of mapping will be rec=
orded
> >> inside the new p_ksvar field of the struct proc.
> >> This field will be exported through a sysctl to the user space process=
es.
> >> In order to implement syscalls as user space routines, we have to find=
 out the
> >> mapped address of the kernel shared variables when the libc is mapped =
into
> >> the process. So I added a function marked with the attribute construct=
or.
> >> It will called before any code into user process and before any code i=
nside
> >> the libc.
> >>
> >> +__attribute((constructor)) void init_kernel_shared()
> >> +{
> >> + =9A =9A =9A int mib[2];
> >> + =9A =9A =9A size_t len;
> >> + =9A =9A =9A vm_offset_t ksvar_address;
> >> +
> >> + =9A =9A =9A mib[0] =3D CTL_KERN;
> >> + =9A =9A =9A mib[1] =3D KERN_KSVAR;
> >> + =9A =9A =9A len =3D sizeof(vm_offset_t);
> >> + =9A =9A =9A if (__sysctl(mib, 2, (void *) &ksvar_address, &len, NULL=
, 0) !=3D -1)
> >> + =9A =9A =9A =9A =9A =9A =9A ksvar_table =3D (uint32_t *) ksvar_addre=
ss;
> >> +}
> >>
> >> Once the libc knows the address of the table it can access to the shar=
ed
> >> variables.
> >>
> >> Just as proof of concept I re-implemented gettimeofday(3) in user spac=
e.
> >> First of all I didn't remove the entry into the syscall.master, just r=
enamed the
> >> sys_gettimeofday. I need it for the fallback path.
> >> In the kernel I introduced a struct wall_clock.
> >>
> >> +struct wall_clock
> >> +{
> >> + =9A =9A =9A struct timeval =9Atv;
> >> + =9A =9A =9A struct timezone tz;
> >> +};
> >>
> >> The struct is exported through sys/sys/time.h header.
> >> I defined a new kernel shared variable. To do so I added an index in
> >> sys/sys/ksvar.h
> >> WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1.
> >> In the sys/kern/kern_clocksource.c
> >>
> >> +/* kernel shared variable for implmenting gettimeofday. */
> >> +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
> >>
> >> Now we defined a shared variable at index WALL_CLOCK_INDEX of type
> >> struct wall_clock and named wall_clock.
> >> Inside handleevents I update the info exported by wall_clock.
> >>
> >> + =9A =9A =9A struct timeval tv;
> >> +
> >> + =9A =9A =9A /* update time for userspace gettimeofday */
> >> + =9A =9A =9A microtime(&tv);
> >> + =9A =9A =9A wall_clock.tv =3D tv;
> >> + =9A =9A =9A wall_clock.tz.tz_minuteswest =3D tz_minuteswest;
> >> + =9A =9A =9A wall_clock.tz.tz_dsttime =3D tz_dsttime;
> >>
> >> Now, in libc we import the shared variable
> >>
> >> +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
> >>
> >> note that WALL_CLOCK_INDEX must be the same of the one defined
> >> inside the kernel, and define a new function gettimeofday
> >>
> >> +int
> >> +gettimeofday(struct timeval *tp, struct timezone *tzp)
> >> +{
> >> +
> >> + =9A =9A =9A /* fallback to syscall if kernel doesn't export ksvar */
> >> + =9A =9A =9A if (!KSVAR_IS_ACTIVE())
> >> + =9A =9A =9A =9A =9A =9A =9A return (sys_gettimeofday(tp, tzp));
> >> +
> >> + =9A =9A =9A if (tp !=3D NULL)
> >> + =9A =9A =9A =9A =9A =9A =9A *tp =3D KSVAR(wall_clock)->tv;
> >> + =9A =9A =9A if (tzp !=3D NULL)
> >> + =9A =9A =9A =9A =9A =9A =9A *tzp =3D KSVAR(wall_clock)->tz;
> >> + =9A =9A =9A return (0);
> >> +}
> >>
> >> Now when a process will call getimeofday, will call that function actu=
ally.
> >> If the process makes a lot of call to gettimeofday, we will see a
> >> performance boost.
> >> Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE),
> >> the function
> >> fallback to call the actual syscall (sys_gettimeofday).
> >>
> >> Open tasks
> >> - implement support for 32-bit emulated processes running in a 64-bit
> >> environment.
> >> - extend support to others arch
> >> - implement more syscalls
> >> - benchmarks
> >> - Test, test, test.
> >>
> >> I'm looking forward to hear about your comments and suggestions.
> >
> > I very much dislike what you described, it makes ABI maintanence
> > a nightmare.
> > Below is some mail I wrote around Spring 2009, making some notes about
> > desired proposal. This is what called vdso in Linux land.
>=20
> Did you bother to read at least Giovanni's description?
> Because this has nothing to do with VDSO in Linux.
Did you bothered to think shortly why do I object ?

>=20
> I think, he just wants to map in userland processes some pages from
> the static image of the kernel (packed together in a specific
> dataset). This imposes some non-trivial problem. The first thing is
> that the static image is not thought to have physical pages tied to
> it. The second is that he needs to make a clean design in order to let
> consumer of this mechanism to correctly locate informations they want
> within the shared page(s) and in the end read the correct values.
Right, exactly, and this is why I object to the "offsets" approach.
It basically moves us to the old times of the "jump tables" shared
libraries, that fortunately was never a case for FreeBSD even when
a.out was used.

>=20
> I have some reservations on both the implementation and the approach
> for retrieving datas from the page.
> In particular, I don't like that a new vm_object is allocated for this
> page. What I really would like would be:
> 1) very minimal implementation -- you just use
> pmap_enter()/pmap_remove() specifically when needed, separately, in
> fork(), execve(), etc. cases
Oh, this simply cannot work.

> 2) more complete approach -- you make a very quick layer which let you
> map pages from the static image of the kernel and the shared page
> becomes just a specific consumer of this. This way the object has much
> more sense because it becomes an object associated to all the static
> image of the kernel
So you want to circumvent the vm layer.

>=20
> About the layering, I don't like that you require both a kernel and
> userland header to locate the objects within the page. This is very
> likely ABI breakage prone. It is needed a mechanism for retrieving at
> run time what Giovanni calls "indexes", or making it indexes-agnostic.

And this is what VDSO is for. VDSO with the standard ELF symbol
interposition rules allow to have libc that is completely unaware of the
shared page and 'indexes', i.e. which works both for older kernel that
do not export required index, and for new kernels that export the same
information in some more advanced format. By having VDSO that exports
e.g. gettimeofday() we would get override for libc gettimeofday, while
having fully functional libc for other, future and past, kernels, even
if the format of the data exported for super-fast gettimeofday changes.

The tight between VDSO and kernel is not a problem, since VDSO is part
of the kernel from the deployment POV. More. either existing ELF
linker in kernel, or some trivial modifications to it, would allow
to not use 'indexes' on the kernel side too.

We already have a shared page between kernel and whole set of the same-ABI
processes. Currently it is used for signal trampolines only.
The hard parts of the task is to provide VDSO build glue. Also IMO the
hard task is to define sensible gettimeofday() implementation, probably
using rdtsc in usermode. Shared page is easy, or at least it is already
there without ugly and non-working vm hacks.

As an additional note, already put by Bruce, the implementation of
usermode gettimeofday is exactly opposite of any reasonable implementation.
It looses the precision to the frequency of the event timer. Obvious
approach is to not have any periodically updating data for gettimeofday
purpose, and use some formula with rdtsc and kernel-provided coefficients
on the machines where rdtsc is usable.

Interesting question is how much shared the shared page needs be.
Obvious needs are shared between all same-ABI processes, but I can also
easily see a need for the per-process private information be present in
the 'private-shared' page. For silly but typical example, useful for
moronix-style benchmarks, see getpid().

Shrug.

--gQmjQz8lQ7hwrZL9
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk/KQ+8ACgkQC3+MBN1Mb4g2NACgkLX/iLA3GzLGxP81Orzy+X7G
GVEAoIuyoHDauMOErYp+wNLxNWZp5vBF
=gT67
-----END PGP SIGNATURE-----

--gQmjQz8lQ7hwrZL9--

From owner-freebsd-arch@FreeBSD.ORG  Sat Jun  2 17:00:09 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E67B4106564A;
	Sat,  2 Jun 2012 17:00:08 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com
	[209.85.215.54])
	by mx1.freebsd.org (Postfix) with ESMTP id AE3298FC18;
	Sat,  2 Jun 2012 17:00:07 +0000 (UTC)
Received: by laai10 with SMTP id i10so2823846laa.13
	for <multiple recipients>; Sat, 02 Jun 2012 10:00:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:content-type
	:content-transfer-encoding;
	bh=kTgNyiKz8flKgtATHJ/9fmR9G2Pd/ZSOjw0m3AUVvQQ=;
	b=zdzkBwlKOmni4ta3rf9Y6A3FqUJBlyBreqeKnCKGsogRVSZ9uamJ07LPh5h9YwylTQ
	fFcy+UG7DghtNLzajGSaoBuPnjU25XcBS+iue8qaKXkSOaZ8f20uDivBI1olm02aNxmf
	J7eKIUiE5czQbqfOloaCMl2PS5TKogg+m9t911LOkNWK8KRUBBTreDx1x7e/Ysp87Oqi
	s4TH0FT357S42AEbvkgg8jP+CJAknCfRSjH6ccDc7JmBx/Sxpl4y0CCttWpl3czLhM+9
	ZbEncDquAacDefbjlvyK5hsafbIpFuGwdRfVKbpbCewtyyeZT5QJkJ7cqgbiT7cwxQy8
	gF7w==
MIME-Version: 1.0
Received: by 10.112.45.4 with SMTP id i4mr3701338lbm.79.1338656406504; Sat, 02
	Jun 2012 10:00:06 -0700 (PDT)
Sender: asmrookie@gmail.com
Received: by 10.112.27.65 with HTTP; Sat, 2 Jun 2012 10:00:06 -0700 (PDT)
In-Reply-To: <CAJ-FndAXFwuEspq+QeF0Hv1dr8JjREP=c=g3-abP=eoZ-D4hEg@mail.gmail.com>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
	<20120601193522.GA2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndC71=3Jo+BxQi==gCoLipBxj8X8XMBydjvrcKeGw+WOnA@mail.gmail.com>
	<20120602164847.GB2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndAXFwuEspq+QeF0Hv1dr8JjREP=c=g3-abP=eoZ-D4hEg@mail.gmail.com>
Date: Sat, 2 Jun 2012 18:00:06 +0100
X-Google-Sender-Auth: sMCBm15RYm0QSB4Z016r5COjoKo
Message-ID: <CAJ-FndCpztSWyJo2hRVs5qu+vQOj9E1mPBhfVOxM_OC2eNac6A@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: freebsd-arch@freebsd.org, Gianni <gianni@freebsd.org>, 
	Alexander Kabaev <kan@freebsd.org>, Alan Cox <alc@rice.edu>,
	Konstantin Belousov <kib@freebsd.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: 
Subject: Fwd: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Jun 2012 17:00:09 -0000

Sorry, resending with all the recipients in.

Attilio


---------- Forwarded message ----------
From: Attilio Rao <attilio@freebsd.org>
Date: 2012/6/2
Subject: Re: [RFC] Kernel shared variables
To: Konstantin Belousov <kostikbel@gmail.com>


2012/6/2 Konstantin Belousov <kostikbel@gmail.com>:
> On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote:
>> 2012/6/1 Konstantin Belousov <kostikbel@gmail.com>:
>> > On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote:
>> >> Hello,
>> >> I'd like to discuss a way to provide a mechanism to share some read-o=
nly
>> >> data between kernel and user space programs avoiding syscall overhead=
,
>> >> implementing some them, such as gettimeofday(3) and time(3) as ordina=
ry
>> >> user space routine.
>> >>
>> >> The patch at
>> >> http://www.trematerra.net/patches/ksvar_experimental.patch
>> >>
>> >> is in a very experimental stage. It's just a proof-of-concept.
>> >> Only works for an AMD64 kernel and only for 64-bit applications.
>> >> The idea is to have all the variables that we want to share between k=
ernel
>> >> and user space into one or more consecutive pages of memory that will=
 be
>> >> mapped read-only into every running process. At the start of the firs=
t
>> >> shared page
>> >> there'll be a table with as many entries as the number of the shared =
variables.
>> >> Each entry is a 32-bit value that is the offset between the start of =
the shared
>> >> page and the start of the variable in the page. The user space proces=
ses need
>> >> to find out the map address of shared page and use the table to acces=
s to the
>> >> shared variables.
>> >> Kernel will export a variable to user space as an index, so user spac=
e code
>> >> must refer to a specific index to access a kernel shared variable.
>> >> Let's take a quick look to the KPI/API for exporting/importing kernel
>> >> shared variables.
>> >> Say we want implement a routine to export an int from the kernel.
>> >> To define the variable to be exported inside the kernel you would use
>> >>
>> >> KSVAR_DEFINE(0, int, test_value);
>> >>
>> >> You have just defined an int variable named "test_value" at index 0.
>> >> Inside the kernel you can write/read as usual using the symbol test_v=
alue;
>> >> Now you likely want add to libc a function callable from user process=
es
>> >> that return the test_value variable. So first of all you need the imp=
ort the
>> >> variable.
>> >>
>> >> KSVAR_IMPORT(0, int, test_value);
>> >>
>> >> and to obtain a pointer to read the value you would use
>> >>
>> >> KSVAR(test_value);
>> >>
>> >> so your function would look like something like this
>> >>
>> >> int get_test_value()
>> >> {
>> >>
>> >> =C2=A0 =C2=A0 =C2=A0return (*KSVAR(test_value));
>> >> }
>> >>
>> >> Then inside your process just call get_test_value() function as you u=
sually
>> >> do and you'll get a kernel written value without switching in kernel =
mode.
>> >>
>> >> Let's see now in more detail how that could be accomplished.
>> >> The shared variables will be accessed as normal variables and are rea=
d/write
>> >> inside the kernel. The variables need to be inside the same page(s) a=
nd nothing
>> >> but the shared variables (and the table) must be into the page(s). To
>> >> obtain that
>> >> I changed the linker script in this way
>> >>
>> >> --- a/sys/conf/ldscript.amd64
>> >> +++ b/sys/conf/ldscript.amd64
>> >> @@ -177,6 +177,15 @@ SECTIONS
>> >> =C2=A0 =C2=A0 *(.ldata .ldata.* .gnu.linkonce.l.*)
>> >> =C2=A0 =C2=A0 . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1);
>> >> =C2=A0 }
>> >> + =C2=A0.ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) :
>> >> + =C2=A0{
>> >> + =C2=A0 =C2=A0__ksvar_set_start =3D .;
>> >> + =C2=A0 =C2=A0*(.ksvar_table)
>> >> + =C2=A0 =C2=A0*(.ksvar)
>> >> +
>> >> + =C2=A0 . =3D ALIGN(CONSTANT (COMMONPAGESIZE));
>> >> + =C2=A0 __ksvar_set_stop =3D .;
>> >> + =C2=A0}
>> >> =C2=A0 . =3D ALIGN(64 / 8);
>> >> =C2=A0 _end =3D .; PROVIDE (end =3D .);
>> >> =C2=A0 . =3D DATA_SEGMENT_END (.);
>> >>
>> >> When we want to define a variable in the kernel to share with user sp=
ace
>> >> we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h
>> >>
>> >> +struct ksvar_set {
>> >> + =C2=A0 =C2=A0 =C2=A0 uint32_t idx;
>> >> + =C2=A0 =C2=A0 =C2=A0 char *pksvar;
>> >> +};
>> >> +
>> >> +/*
>> >> + * Declare a variable into kernel shared linker_set.
>> >> + */
>> >> +#define =C2=A0 =C2=A0 =C2=A0 =C2=A0KSVAR_DEFINE(index, type, name) \
>> >> + =C2=A0 =C2=A0 =C2=A0 static type name __section(".ksvar"); =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \
>> >> + =C2=A0 =C2=A0 =C2=A0 static struct ksvar_set name ## _ksvar_set =3D=
 { =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\
>> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .idx =3D index, =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \
>> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .pksvar =3D (char =
*) &name =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0\
>> >> + =C2=A0 =C2=A0 =C2=A0 }; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\
>> >> + =C2=A0 =C2=A0 =C2=A0 DATA_SET(ksvar_set, name ## _ksvar_set)
>> >>
>> >> Every variable must have a unique index. The indexes must
>> >> start from zero and be consecutive. When you add an index
>> >> you must bump the size of the table (KSVAR_TABLE_SIZE)
>> >> (see sys/sys/ksvar.h)
>> >>
>> >> The variables are inside the kernel static image that isn't managed
>> >> by the VM and so we need to allocate pages to map the physical addres=
ses.
>> >> A new SYSINIT (ksvarinit) will allocate a set of vm_page_t =C2=A0thro=
ugh
>> >> the vm_phys_fictitious_reg_range interface and fill the table using
>> >> the information
>> >> of the ksvar_set linker set, then will create a vm_object_t (vm_objec=
t_ksvar),
>> >> mark the fake pages as valid and put them into it.
>> >> When a new process is created by exec(3) the vm_object_ksvar will be
>> >> mapped read-only into the process address space by vm_map_fixed routi=
ne
>> >> just before mapping the user stack. The address of mapping will be re=
corded
>> >> inside the new p_ksvar field of the struct proc.
>> >> This field will be exported through a sysctl to the user space proces=
ses.
>> >> In order to implement syscalls as user space routines, we have to fin=
d out the
>> >> mapped address of the kernel shared variables when the libc is mapped=
 into
>> >> the process. So I added a function marked with the attribute construc=
tor.
>> >> It will called before any code into user process and before any code =
inside
>> >> the libc.
>> >>
>> >> +__attribute((constructor)) void init_kernel_shared()
>> >> +{
>> >> + =C2=A0 =C2=A0 =C2=A0 int mib[2];
>> >> + =C2=A0 =C2=A0 =C2=A0 size_t len;
>> >> + =C2=A0 =C2=A0 =C2=A0 vm_offset_t ksvar_address;
>> >> +
>> >> + =C2=A0 =C2=A0 =C2=A0 mib[0] =3D CTL_KERN;
>> >> + =C2=A0 =C2=A0 =C2=A0 mib[1] =3D KERN_KSVAR;
>> >> + =C2=A0 =C2=A0 =C2=A0 len =3D sizeof(vm_offset_t);
>> >> + =C2=A0 =C2=A0 =C2=A0 if (__sysctl(mib, 2, (void *) &ksvar_address, =
&len, NULL, 0) !=3D -1)
>> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ksvar_table =3D (u=
int32_t *) ksvar_address;
>> >> +}
>> >>
>> >> Once the libc knows the address of the table it can access to the sha=
red
>> >> variables.
>> >>
>> >> Just as proof of concept I re-implemented gettimeofday(3) in user spa=
ce.
>> >> First of all I didn't remove the entry into the syscall.master, just =
renamed the
>> >> sys_gettimeofday. I need it for the fallback path.
>> >> In the kernel I introduced a struct wall_clock.
>> >>
>> >> +struct wall_clock
>> >> +{
>> >> + =C2=A0 =C2=A0 =C2=A0 struct timeval =C2=A0tv;
>> >> + =C2=A0 =C2=A0 =C2=A0 struct timezone tz;
>> >> +};
>> >>
>> >> The struct is exported through sys/sys/time.h header.
>> >> I defined a new kernel shared variable. To do so I added an index in
>> >> sys/sys/ksvar.h
>> >> WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1.
>> >> In the sys/kern/kern_clocksource.c
>> >>
>> >> +/* kernel shared variable for implmenting gettimeofday. */
>> >> +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
>> >>
>> >> Now we defined a shared variable at index WALL_CLOCK_INDEX of type
>> >> struct wall_clock and named wall_clock.
>> >> Inside handleevents I update the info exported by wall_clock.
>> >>
>> >> + =C2=A0 =C2=A0 =C2=A0 struct timeval tv;
>> >> +
>> >> + =C2=A0 =C2=A0 =C2=A0 /* update time for userspace gettimeofday */
>> >> + =C2=A0 =C2=A0 =C2=A0 microtime(&tv);
>> >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tv =3D tv;
>> >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_minuteswest =3D tz_minuteswes=
t;
>> >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_dsttime =3D tz_dsttime;
>> >>
>> >> Now, in libc we import the shared variable
>> >>
>> >> +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
>> >>
>> >> note that WALL_CLOCK_INDEX must be the same of the one defined
>> >> inside the kernel, and define a new function gettimeofday
>> >>
>> >> +int
>> >> +gettimeofday(struct timeval *tp, struct timezone *tzp)
>> >> +{
>> >> +
>> >> + =C2=A0 =C2=A0 =C2=A0 /* fallback to syscall if kernel doesn't expor=
t ksvar */
>> >> + =C2=A0 =C2=A0 =C2=A0 if (!KSVAR_IS_ACTIVE())
>> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return (sys_gettim=
eofday(tp, tzp));
>> >> +
>> >> + =C2=A0 =C2=A0 =C2=A0 if (tp !=3D NULL)
>> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tp =3D KSVAR(wall=
_clock)->tv;
>> >> + =C2=A0 =C2=A0 =C2=A0 if (tzp !=3D NULL)
>> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tzp =3D KSVAR(wal=
l_clock)->tz;
>> >> + =C2=A0 =C2=A0 =C2=A0 return (0);
>> >> +}
>> >>
>> >> Now when a process will call getimeofday, will call that function act=
ually.
>> >> If the process makes a lot of call to gettimeofday, we will see a
>> >> performance boost.
>> >> Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE)=
,
>> >> the function
>> >> fallback to call the actual syscall (sys_gettimeofday).
>> >>
>> >> Open tasks
>> >> - implement support for 32-bit emulated processes running in a 64-bit
>> >> environment.
>> >> - extend support to others arch
>> >> - implement more syscalls
>> >> - benchmarks
>> >> - Test, test, test.
>> >>
>> >> I'm looking forward to hear about your comments and suggestions.
>> >
>> > I very much dislike what you described, it makes ABI maintanence
>> > a nightmare.
>> > Below is some mail I wrote around Spring 2009, making some notes about
>> > desired proposal. This is what called vdso in Linux land.
>>
>> Did you bother to read at least Giovanni's description?
>> Because this has nothing to do with VDSO in Linux.
> Did you bothered to think shortly why do I object ?
>
>>
>> I think, he just wants to map in userland processes some pages from
>> the static image of the kernel (packed together in a specific
>> dataset). This imposes some non-trivial problem. The first thing is
>> that the static image is not thought to have physical pages tied to
>> it. The second is that he needs to make a clean design in order to let
>> consumer of this mechanism to correctly locate informations they want
>> within the shared page(s) and in the end read the correct values.
> Right, exactly, and this is why I object to the "offsets" approach.
> It basically moves us to the old times of the "jump tables" shared
> libraries, that fortunately was never a case for FreeBSD even when
> a.out was used.

I'm objecting to this either.

>>
>> I have some reservations on both the implementation and the approach
>> for retrieving datas from the page.
>> In particular, I don't like that a new vm_object is allocated for this
>> page. What I really would like would be:
>> 1) very minimal implementation -- you just use
>> pmap_enter()/pmap_remove() specifically when needed, separately, in
>> fork(), execve(), etc. cases
> Oh, this simply cannot work.

And why? Assuming you provide a vm_page_t from an UMA zone just like
fakepage do. Of course you cannot recycle for this purpose any page
caming from vm_page_alloc().

>> 2) more complete approach -- you make a very quick layer which let you
>> map pages from the static image of the kernel and the shared page
>> becomes just a specific consumer of this. This way the object has much
>> more sense because it becomes an object associated to all the static
>> image of the kernel
> So you want to circumvent the vm layer.

Note sure I agree with your opinion on this.

>>
>> About the layering, I don't like that you require both a kernel and
>> userland header to locate the objects within the page. This is very
>> likely ABI breakage prone. It is needed a mechanism for retrieving at
>> run time what Giovanni calls "indexes", or making it indexes-agnostic.
>
> And this is what VDSO is for. VDSO with the standard ELF symbol
> interposition rules allow to have libc that is completely unaware of the
> shared page and 'indexes', i.e. which works both for older kernel that
> do not export required index, and for new kernels that export the same
> information in some more advanced format. By having VDSO that exports
> e.g. gettimeofday() we would get override for libc gettimeofday, while
> having fully functional libc for other, future and past, kernels, even
> if the format of the data exported for super-fast gettimeofday changes.
>
> The tight between VDSO and kernel is not a problem, since VDSO is part
> of the kernel from the deployment POV. More. either existing ELF
> linker in kernel, or some trivial modifications to it, would allow
> to not use 'indexes' on the kernel side too.

I admit I don't have a better plan on how to retrieve objects from the
shared page at the moment, I didn't give much thought to it.

> We already have a shared page between kernel and whole set of the same-AB=
I
> processes. Currently it is used for signal trampolines only.
> The hard parts of the task is to provide VDSO build glue. Also IMO the
> hard task is to define sensible gettimeofday() implementation, probably
> using rdtsc in usermode. Shared page is easy, or at least it is already
> there without ugly and non-working vm hacks.
>
> As an additional note, already put by Bruce, the implementation of
> usermode gettimeofday is exactly opposite of any reasonable implementatio=
n.
> It looses the precision to the frequency of the event timer. Obvious
> approach is to not have any periodically updating data for gettimeofday
> purpose, and use some formula with rdtsc and kernel-provided coefficients
> on the machines where rdtsc is usable.

The gettimeofday() implementation is a different story than what is asked h=
ere.

> Interesting question is how much shared the shared page needs be.
> Obvious needs are shared between all same-ABI processes, but I can also
> easily see a need for the per-process private information be present in
> the 'private-shared' page. For silly but typical example, useful for
> moronix-style benchmarks, see getpid().

Really the performance benefits of having fast getpid() is marginal if
compared to heavilly used things like gettimeofday(). I cannot think
of a per-process page implementing a fast syscall that can bring many
perfomance advantages.

Attilio


--
Peace can only be achieved by understanding - A. Einstein


--=20
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Sat Jun  2 17:16:37 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 9D316106564A;
	Sat,  2 Jun 2012 17:16:37 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id 083FF8FC1B;
	Sat,  2 Jun 2012 17:16:36 +0000 (UTC)
Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q52HGXPx066108;
	Sat, 2 Jun 2012 20:16:33 +0300 (EEST)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id
	q52HGWLB075163; Sat, 2 Jun 2012 20:16:32 +0300 (EEST)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q52HGWrW075162; 
	Sat, 2 Jun 2012 20:16:32 +0300 (EEST)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Sat, 2 Jun 2012 20:16:32 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Attilio Rao <attilio@freebsd.org>
Message-ID: <20120602171632.GC2358@deviant.kiev.zoral.com.ua>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
	<20120601193522.GA2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndC71=3Jo+BxQi==gCoLipBxj8X8XMBydjvrcKeGw+WOnA@mail.gmail.com>
	<20120602164847.GB2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndAXFwuEspq+QeF0Hv1dr8JjREP=c=g3-abP=eoZ-D4hEg@mail.gmail.com>
	<CAJ-FndCpztSWyJo2hRVs5qu+vQOj9E1mPBhfVOxM_OC2eNac6A@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="0CHKT3anvf6u5QiQ"
Content-Disposition: inline
In-Reply-To: <CAJ-FndCpztSWyJo2hRVs5qu+vQOj9E1mPBhfVOxM_OC2eNac6A@mail.gmail.com>
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: Alexander Kabaev <kan@freebsd.org>, Alan Cox <alc@rice.edu>,
	Konstantin Belousov <kib@freebsd.org>,
	Gianni <gianni@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: Fwd: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Jun 2012 17:16:37 -0000


--0CHKT3anvf6u5QiQ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Jun 02, 2012 at 06:00:06PM +0100, Attilio Rao wrote:
> Sorry, resending with all the recipients in.
>=20
> Attilio
>=20
>=20
> ---------- Forwarded message ----------
> From: Attilio Rao <attilio@freebsd.org>
> Date: 2012/6/2
> Subject: Re: [RFC] Kernel shared variables
> To: Konstantin Belousov <kostikbel@gmail.com>
>=20
>=20
> 2012/6/2 Konstantin Belousov <kostikbel@gmail.com>:
> > On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote:
[Tried to trim the text]

> >> I think, he just wants to map in userland processes some pages from
> >> the static image of the kernel (packed together in a specific
> >> dataset). This imposes some non-trivial problem. The first thing is
> >> that the static image is not thought to have physical pages tied to
> >> it. The second is that he needs to make a clean design in order to let
> >> consumer of this mechanism to correctly locate informations they want
> >> within the shared page(s) and in the end read the correct values.
> > Right, exactly, and this is why I object to the "offsets" approach.
> > It basically moves us to the old times of the "jump tables" shared
> > libraries, that fortunately was never a case for FreeBSD even when
> > a.out was used.
>=20
> I'm objecting to this either.
My english is not good enough to understand this. Do you agree or disagree
with my statement that 'indexes' make it very hard to maintain ABI ?

>=20
> >>
> >> I have some reservations on both the implementation and the approach
> >> for retrieving datas from the page.
> >> In particular, I don't like that a new vm_object is allocated for this
> >> page. What I really would like would be:
> >> 1) very minimal implementation -- you just use
> >> pmap_enter()/pmap_remove() specifically when needed, separately, in
> >> fork(), execve(), etc. cases
> > Oh, this simply cannot work.
>=20
> And why? Assuming you provide a vm_page_t from an UMA zone just like
> fakepage do. Of course you cannot recycle for this purpose any page
> caming from vm_page_alloc().
Due to pv_collect/pmap_pv_reclaim, the pte might be destroyed any time.

Using hacks like mapping the page wired and then needing to hack
any VM space manipulation (fork/rfork/exec/exit/swapout/I possibly
missed several cases) just does not pay for it.

>=20
> >> 2) more complete approach -- you make a very quick layer which let you
> >> map pages from the static image of the kernel and the shared page
> >> becomes just a specific consumer of this. This way the object has much
> >> more sense because it becomes an object associated to all the static
> >> image of the kernel
> > So you want to circumvent the vm layer.
>=20
> Note sure I agree with your opinion on this.
>=20
> >>
> >> About the layering, I don't like that you require both a kernel and
> >> userland header to locate the objects within the page. This is very
> >> likely ABI breakage prone. It is needed a mechanism for retrieving at
> >> run time what Giovanni calls "indexes", or making it indexes-agnostic.
> >
> > And this is what VDSO is for. VDSO with the standard ELF symbol
> > interposition rules allow to have libc that is completely unaware of the
> > shared page and 'indexes', i.e. which works both for older kernel that
> > do not export required index, and for new kernels that export the same
> > information in some more advanced format. By having VDSO that exports
> > e.g. gettimeofday() we would get override for libc gettimeofday, while
> > having fully functional libc for other, future and past, kernels, even
> > if the format of the data exported for super-fast gettimeofday changes.
> >
> > The tight between VDSO and kernel is not a problem, since VDSO is part
> > of the kernel from the deployment POV. More. either existing ELF
> > linker in kernel, or some trivial modifications to it, would allow
> > to not use 'indexes' on the kernel side too.
>=20
> I admit I don't have a better plan on how to retrieve objects from the
> shared page at the moment, I didn't give much thought to it.
>=20
> > We already have a shared page between kernel and whole set of the same-=
ABI
> > processes. Currently it is used for signal trampolines only.
> > The hard parts of the task is to provide VDSO build glue. Also IMO the
> > hard task is to define sensible gettimeofday() implementation, probably
> > using rdtsc in usermode. Shared page is easy, or at least it is already
> > there without ugly and non-working vm hacks.
> >
> > As an additional note, already put by Bruce, the implementation of
> > usermode gettimeofday is exactly opposite of any reasonable implementat=
ion.
> > It looses the precision to the frequency of the event timer. Obvious
> > approach is to not have any periodically updating data for gettimeofday
> > purpose, and use some formula with rdtsc and kernel-provided coefficien=
ts
> > on the machines where rdtsc is usable.
>=20
> The gettimeofday() implementation is a different story than what is asked=
 here.

But the goal is to have fast clocks, right ? What else is planned ?

In fact, I think that if the whole goal is only fast clocks, then we
do not need any additional system mechanisms, since we can easily export
coefficients for rdtsc formula already. E.g. we can put it into elf auxv,
which is ugly but bearable.

>=20
> > Interesting question is how much shared the shared page needs be.
> > Obvious needs are shared between all same-ABI processes, but I can also
> > easily see a need for the per-process private information be present in
> > the 'private-shared' page. For silly but typical example, useful for
> > moronix-style benchmarks, see getpid().
>=20
> Really the performance benefits of having fast getpid() is marginal if
> compared to heavilly used things like gettimeofday(). I cannot think
> of a per-process page implementing a fast syscall that can bring many
> perfomance advantages.

This is completely true, but there may be other process-private data that
could benefit from the low access cost. I just do not know right now.

--0CHKT3anvf6u5QiQ
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk/KSnAACgkQC3+MBN1Mb4itVACg2xwUF4QRdToJDtqPRvRqaVUT
AxwAoIx9JO6bedN2XFgQPWc/EqcAHFvv
=sqUF
-----END PGP SIGNATURE-----

--0CHKT3anvf6u5QiQ--

From owner-freebsd-arch@FreeBSD.ORG  Sat Jun  2 17:28:00 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D9FF91065677;
	Sat,  2 Jun 2012 17:28:00 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-lb0-f182.google.com (mail-lb0-f182.google.com
	[209.85.217.182])
	by mx1.freebsd.org (Postfix) with ESMTP id B96738FC25;
	Sat,  2 Jun 2012 17:27:59 +0000 (UTC)
Received: by lbon10 with SMTP id n10so2985176lbo.13
	for <multiple recipients>; Sat, 02 Jun 2012 10:27:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	bh=Kt/O+PHHOpLhrbzodt5P7CT7LFTo8FnmSBDFttgcJ+Q=;
	b=ExKpAs160ZiDxSTWwl+bnQ6GlYzQUZii4nC6ISVvjIPZiUIGkFgx5RYvrZ/47oSwxX
	BcV6pTut9G8A06AG/px5pq0WaB1CozoIeiHs3B+w88+paZVX4XLt4DoOS3P/LC8t/UZW
	Ldjj57gN2hd3LkolH4tc13lkqR40JAH7Z1clmo9rGkDKOoDjfj5e2ega496ivOuWsRMI
	ubdBng/qjibr5I+7qhNTHxp7do8JYstDIrkUJNNPEEvlm7EA/dohuEOKTmtlbVpve6wV
	GSb4LCs24lC7TVft1T9MLOhH7X0DuIjNpQp4fBpdKk98iT0VmeAXW/Itgh4phR6qpoOa
	NpGw==
MIME-Version: 1.0
Received: by 10.152.131.9 with SMTP id oi9mr6965358lab.39.1338658078575; Sat,
	02 Jun 2012 10:27:58 -0700 (PDT)
Sender: asmrookie@gmail.com
Received: by 10.112.27.65 with HTTP; Sat, 2 Jun 2012 10:27:58 -0700 (PDT)
In-Reply-To: <20120602171632.GC2358@deviant.kiev.zoral.com.ua>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
	<20120601193522.GA2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndC71=3Jo+BxQi==gCoLipBxj8X8XMBydjvrcKeGw+WOnA@mail.gmail.com>
	<20120602164847.GB2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndAXFwuEspq+QeF0Hv1dr8JjREP=c=g3-abP=eoZ-D4hEg@mail.gmail.com>
	<CAJ-FndCpztSWyJo2hRVs5qu+vQOj9E1mPBhfVOxM_OC2eNac6A@mail.gmail.com>
	<20120602171632.GC2358@deviant.kiev.zoral.com.ua>
Date: Sat, 2 Jun 2012 18:27:58 +0100
X-Google-Sender-Auth: 9_oCPwwfNLdlvGTKGIF6BSa9YTA
Message-ID: <CAJ-FndCh77syp+860LaCbgQ6eiQAq_OMM98RxqxmCv+YKENXoA@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: Alexander Kabaev <kan@freebsd.org>, Alan Cox <alc@rice.edu>,
	Konstantin Belousov <kib@freebsd.org>,
	Gianni <gianni@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: Fwd: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Jun 2012 17:28:00 -0000

2012/6/2 Konstantin Belousov <kostikbel@gmail.com>:
> On Sat, Jun 02, 2012 at 06:00:06PM +0100, Attilio Rao wrote:
>> Sorry, resending with all the recipients in.
>>
>> Attilio
>>
>>
>> ---------- Forwarded message ----------
>> From: Attilio Rao <attilio@freebsd.org>
>> Date: 2012/6/2
>> Subject: Re: [RFC] Kernel shared variables
>> To: Konstantin Belousov <kostikbel@gmail.com>
>>
>>
>> 2012/6/2 Konstantin Belousov <kostikbel@gmail.com>:
>> > On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote:
> [Tried to trim the text]
>
>> >> I think, he just wants to map in userland processes some pages from
>> >> the static image of the kernel (packed together in a specific
>> >> dataset). This imposes some non-trivial problem. The first thing is
>> >> that the static image is not thought to have physical pages tied to
>> >> it. The second is that he needs to make a clean design in order to let
>> >> consumer of this mechanism to correctly locate informations they want
>> >> within the shared page(s) and in the end read the correct values.
>> > Right, exactly, and this is why I object to the "offsets" approach.
>> > It basically moves us to the old times of the "jump tables" shared
>> > libraries, that fortunately was never a case for FreeBSD even when
>> > a.out was used.
>>
>> I'm objecting to this either.
> My english is not good enough to understand this. Do you agree or disagree
> with my statement that 'indexes' make it very hard to maintain ABI ?

I agree with you. The offset approach just doesn't work clean on an
ABI perspective.

>> >>
>> >> I have some reservations on both the implementation and the approach
>> >> for retrieving datas from the page.
>> >> In particular, I don't like that a new vm_object is allocated for this
>> >> page. What I really would like would be:
>> >> 1) very minimal implementation -- you just use
>> >> pmap_enter()/pmap_remove() specifically when needed, separately, in
>> >> fork(), execve(), etc. cases
>> > Oh, this simply cannot work.
>>
>> And why? Assuming you provide a vm_page_t from an UMA zone just like
>> fakepage do. Of course you cannot recycle for this purpose any page
>> caming from vm_page_alloc().
> Due to pv_collect/pmap_pv_reclaim, the pte might be destroyed any time.
>
> Using hacks like mapping the page wired and then needing to hack
> any VM space manipulation (fork/rfork/exec/exit/swapout/I possibly
> missed several cases) just does not pay for it.

Well my take was to map the page wired because of the nature of the
workload too (static image -- present in memory -- wired page).


[ trim ]

>> The gettimeofday() implementation is a different story than what is asked here.
>
> But the goal is to have fast clocks, right ? What else is planned ?
>
> In fact, I think that if the whole goal is only fast clocks, then we
> do not need any additional system mechanisms, since we can easily export
> coefficients for rdtsc formula already. E.g. we can put it into elf auxv,
> which is ugly but bearable.

Not sure if there is anything else besides gettimeofday() that we want
right now, in particular on global basis.
I just mean to say that I don't think Giovanni put a lot of effort in
correctness/robustness of gettimeofday userland implementation, so we
should not judge that part of the patch too tightly.

>> > Interesting question is how much shared the shared page needs be.
>> > Obvious needs are shared between all same-ABI processes, but I can also
>> > easily see a need for the per-process private information be present in
>> > the 'private-shared' page. For silly but typical example, useful for
>> > moronix-style benchmarks, see getpid().
>>
>> Really the performance benefits of having fast getpid() is marginal if
>> compared to heavilly used things like gettimeofday(). I cannot think
>> of a per-process page implementing a fast syscall that can bring many
>> perfomance advantages.
>
> This is completely true, but there may be other process-private data that
> could benefit from the low access cost. I just do not know right now.

I don't know either, thus I don't think there is a big urgence for
per-process shared pages at all.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Sat Jun  2 20:05:21 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 60D2C1065672;
	Sat,  2 Jun 2012 20:05:21 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail35.syd.optusnet.com.au (mail35.syd.optusnet.com.au
	[211.29.133.51])
	by mx1.freebsd.org (Postfix) with ESMTP id E47598FC12;
	Sat,  2 Jun 2012 20:05:20 +0000 (UTC)
Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232])
	by mail35.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q52K5Atd015942
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 3 Jun 2012 06:05:11 +1000
Date: Sun, 3 Jun 2012 06:05:10 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
In-Reply-To: <20120602164847.GB2358@deviant.kiev.zoral.com.ua>
Message-ID: <20120603053445.Y3302@besplex.bde.org>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
	<20120601193522.GA2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndC71=3Jo+BxQi==gCoLipBxj8X8XMBydjvrcKeGw+WOnA@mail.gmail.com>
	<20120602164847.GB2358@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Attilio Rao <attilio@FreeBSD.org>, alc@FreeBSD.org,
	Giovanni Trematerra <giovanni.trematerra@gmail.com>,
	Alexander Kabaev <kan@FreeBSD.org>, freebsd-arch@FreeBSD.org
Subject: Re: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Jun 2012 20:05:21 -0000

On Sat, 2 Jun 2012, Konstantin Belousov wrote:

> On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote:
>> ...
>> I have some reservations on both the implementation and the approach
>> for retrieving datas from the page.
>> In particular, I don't like that a new vm_object is allocated for this
>> page. What I really would like would be:
>> 1) very minimal implementation -- you just use
>> pmap_enter()/pmap_remove() specifically when needed, separately, in
>> fork(), execve(), etc. cases
> Oh, this simply cannot work.
>
>> 2) more complete approach -- you make a very quick layer which let you
>> map pages from the static image of the kernel and the shared page
>> becomes just a specific consumer of this. This way the object has much
>> more sense because it becomes an object associated to all the static
>> image of the kernel
> So you want to circumvent the vm layer.
>>
>> About the layering, I don't like that you require both a kernel and
>> userland header to locate the objects within the page. This is very
>> likely ABI breakage prone. It is needed a mechanism for retrieving at
>> run time what Giovanni calls "indexes", or making it indexes-agnostic.
>
> And this is what VDSO is for. VDSO with the standard ELF symbol
> interposition rules allow to have libc that is completely unaware of the
> shared page and 'indexes', i.e. which works both for older kernel that
> do not export required index, and for new kernels that export the same
> information in some more advanced format. By having VDSO that exports

I have no strong ideas about the ABI issues.  Even shared libraries are
too large and complicated for me :-).

> e.g. gettimeofday() we would get override for libc gettimeofday, while
> having fully functional libc for other, future and past, kernels, even
> if the format of the data exported for super-fast gettimeofday changes.

Please no getttimeofday() for the example :-).

> As an additional note, already put by Bruce, the implementation of
> usermode gettimeofday is exactly opposite of any reasonable implementation.
> It looses the precision to the frequency of the event timer. Obvious
> approach is to not have any periodically updating data for gettimeofday
> purpose, and use some formula with rdtsc and kernel-provided coefficients
> on the machines where rdtsc is usable.

Actually, you can probably do gettimeofday() by exporting mounds of
excecute-only and read-only kernel code and data in the in the shared
page(s).  The kernel code becomes just another way of implementing a
shared library that is especially good for syscalls.  It needs to run
with only user privilege.  x86 rdtsc normally has user privilege.  User
privilege for timecounter hardware in bus space would be problematic.
Actually^2, you only need a small amount of kernel code for this --
just microtime() and what it calls, with only the timecounter hardware
call being a problem.  The kernel maintains lots of not-quite-constant
timecounter state (primarily timehands offsets) that can be locked in
the time domain in the same way that it is in the kernel.

> Interesting question is how much shared the shared page needs be.
> Obvious needs are shared between all same-ABI processes, but I can also
> easily see a need for the per-process private information be present in
> the 'private-shared' page. For silly but typical example, useful for
> moronix-style benchmarks, see getpid().

Slightly better benchmarks use getppid() since the parent pid is not
quite constant so it can't easily be cached in userland.  But with
a kernel read-only pages, it it doesn't even need time domain locking,
since getppid() is inherently racy (the parent may go away) before it
returns.

Lots of read-only syscalls that don't require privilege or much locking
could be implemented similarly.  All syscalls can be put in the shared
executable page(s), with most reducing to the same library code as now
to actually enter the kernel.  This is too large and complicated for me.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Sat Jun  2 21:28:16 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id C6289106566B;
	Sat,  2 Jun 2012 21:28:16 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au
	[211.29.132.185])
	by mx1.freebsd.org (Postfix) with ESMTP id 30ED98FC1E;
	Sat,  2 Jun 2012 21:28:13 +0000 (UTC)
Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232])
	by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q52LS958002626
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 3 Jun 2012 07:28:10 +1000
Date: Sun, 3 Jun 2012 07:28:09 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
In-Reply-To: <20120602171632.GC2358@deviant.kiev.zoral.com.ua>
Message-ID: <20120603063330.H3418@besplex.bde.org>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
	<20120601193522.GA2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndC71=3Jo+BxQi==gCoLipBxj8X8XMBydjvrcKeGw+WOnA@mail.gmail.com>
	<20120602164847.GB2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndAXFwuEspq+QeF0Hv1dr8JjREP=c=g3-abP=eoZ-D4hEg@mail.gmail.com>
	<CAJ-FndCpztSWyJo2hRVs5qu+vQOj9E1mPBhfVOxM_OC2eNac6A@mail.gmail.com>
	<20120602171632.GC2358@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Gianni <gianni@FreeBSD.org>, Alan Cox <alc@rice.edu>,
	Alexander Kabaev <kan@FreeBSD.org>, Attilio Rao <attilio@FreeBSD.org>,
	Konstantin Belousov <kib@FreeBSD.org>, freebsd-arch@FreeBSD.org
Subject: Re: Fwd: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Jun 2012 21:28:16 -0000

On Sat, 2 Jun 2012, Konstantin Belousov wrote:

> On Sat, Jun 02, 2012 at 06:00:06PM +0100, Attilio Rao wrote:
>> ...
>> 2012/6/2 Konstantin Belousov <kostikbel@gmail.com>:
>>> On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote:
> [Tried to trim the text]

[Trimmed more]

>>> Right, exactly, and this is why I object to the "offsets" approach.
>>> It basically moves us to the old times of the "jump tables" shared
>>> libraries, that fortunately was never a case for FreeBSD even when
>>> a.out was used.
>>
>> I'm objecting to this either.
> My english is not good enough to understand this. Do you agree or disagree
> with my statement that 'indexes' make it very hard to maintain ABI ?

Syscall numbers are basically indexes, and work OK (because there aren't
many of them even after ~30-35 years of accumulating them).

> ...
>> The gettimeofday() implementation is a different story than what is asked here.
>
> But the goal is to have fast clocks, right ? What else is planned ?
>
> In fact, I think that if the whole goal is only fast clocks, then we
> do not need any additional system mechanisms, since we can easily export
> coefficients for rdtsc formula already. E.g. we can put it into elf auxv,
> which is ugly but bearable.

How do you get the timehands offsets?  These only need to be updated
every second or so, or when used, but how can the application know
when they need to be updated if this is not done automatically in the
kernel by writing to a shared page?  I can only think of the
application arranging an alarm signal every second or so and updating
then.  No good for libraries.

rdtsc is also very unportable, even on CPUs that have it.  But all other
x86 timecounter hardware is too slow if you want gettimeofday() to be fast
and as accurate as it is now.

Bruce