From owner-freebsd-arch@FreeBSD.ORG Sun May 27 16:22:12 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0AA2F1065672 for ; Sun, 27 May 2012 16:22:12 +0000 (UTC) (envelope-from rmh.aybabtu@gmail.com) Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id C361C8FC19 for ; Sun, 27 May 2012 16:22:11 +0000 (UTC) Received: by obcni5 with SMTP id ni5so5488833obc.13 for ; Sun, 27 May 2012 09:22:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=2T4M2yrOSJB4ztn5XRBiqwoJgEhsYzJ4rbeK6W7Rbl4=; b=AXJb4ux4hA7pq+pbGXPssQ7zwQ3fne9MiN4Dis3JYxgF6KUw9ZpmYF3ajeFshEf62x JshPco9ZvprtQch60RaQE/ED27skEMhGGkzCl0E6JSuryhhyLNvEmrP8OkY3md9is1wz /L/qCWEtDZGzezYXBNcdE0TtRieeVTZWK+bbPyfR8qqs/elzBe2W3FCqDgBO8DMtDlg1 Q33DFs4qz2UGT7RpC/zmqkZUrKMxzG6kbpeYD8cOK69xMUiU4aVoTTM7m7S4ZsNV6phk 5qLuTcxWVwbBdS7MteVU8j4wP1jzt39JwRBDfGlZY9N3OkXydnIq6HhKQQKPhx+I8NIm DT4g== MIME-Version: 1.0 Received: by 10.50.149.129 with SMTP id ua1mr2776036igb.43.1338135731004; Sun, 27 May 2012 09:22:11 -0700 (PDT) Sender: rmh.aybabtu@gmail.com Received: by 10.42.202.84 with HTTP; Sun, 27 May 2012 09:22:10 -0700 (PDT) In-Reply-To: <20120519134005.GJ2358@deviant.kiev.zoral.com.ua> References: <20120519134005.GJ2358@deviant.kiev.zoral.com.ua> Date: Sun, 27 May 2012 18:22:10 +0200 X-Google-Sender-Auth: vlHsJ8bKVAiH4YZpZsNV5zuZeSc Message-ID: From: Robert Millan To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Cc: freebsd-arch@freebsd.org Subject: Re: headers that use "struct bintime" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 May 2012 16:22:12 -0000 2012/5/19 Konstantin Belousov : >> sys/arm/include/cpu.h >> sys/dev/iscsi/initiator/iscsivar.h >> sys/geom/journal/g_journal.h >> sys/sys/dtrace_bsd.h >> sys/sys/devicestat.h >> sys/sys/timeet.h >> sys/sys/bio.h >> sys/opencrypto/cryptodev.h >> > Note that all headers you listed are kernel headers, and kernel is exposed > to the whole namespace. I suspect that no headers are supposed to be used > by usermode among the list. There's at least one case (sys/devicestat.h) which is widely exposed to userland: lib/libdevstat/devstat.h:#include lib/libgeom/geom_stats.c:#include usr.bin/kdump/ioctl.c:#include sbin/mdconfig/mdconfig.c:#include and also into sys/cam/, some of which is in userland too (built into libcam): sys/cam/ata/ata_pmp.c:#include sys/cam/ata/ata_da.c:#include sys/cam/scsi/scsi_pt.c:#include sys/cam/scsi/scsi_pass.c:#include sys/cam/scsi/scsi_targ_bh.c:#include sys/cam/scsi/scsi_sa.c:#include sys/cam/scsi/scsi_da.c:#include sys/cam/scsi/scsi_target.c:#include sys/cam/scsi/scsi_sg.c:#include sys/cam/scsi/scsi_cd.c:#include sys/cam/cam_periph.c:#include -- Robert Millan From owner-freebsd-arch@FreeBSD.ORG Sun May 27 18:01:08 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7B2291065677; Sun, 27 May 2012 18:01:08 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au [211.29.132.188]) by mx1.freebsd.org (Postfix) with ESMTP id EB0708FC28; Sun, 27 May 2012 18:01:07 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q4RI0x4E019202 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 28 May 2012 04:01:00 +1000 Date: Mon, 28 May 2012 04:00:59 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Robert Millan In-Reply-To: Message-ID: <20120528023818.F2417@besplex.bde.org> References: <20120519134005.GJ2358@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Konstantin Belousov , freebsd-arch@FreeBSD.org Subject: Re: headers that use "struct bintime" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 May 2012 18:01:08 -0000 On Sun, 27 May 2012, Robert Millan wrote: > 2012/5/19 Konstantin Belousov : >>> sys/arm/include/cpu.h >>> sys/dev/iscsi/initiator/iscsivar.h >>> sys/geom/journal/g_journal.h >>> sys/sys/dtrace_bsd.h >>> sys/sys/devicestat.h >>> sys/sys/timeet.h >>> sys/sys/bio.h >>> sys/opencrypto/cryptodev.h >>> >> Note that all headers you listed are kernel headers, and kernel is exposed >> to the whole namespace. I suspect that no headers are supposed to be used >> by usermode among the list. > > There's at least one case (sys/devicestat.h) which is widely exposed > to userland: devstat has silly APIs, with long double values (despite needing the range and precision of long doubles over doubles less than most things) and bintimes (despite needing the precision of bintimes over less than most things), but is well established so is hard to fix now. > lib/libdevstat/devstat.h:#include > lib/libgeom/geom_stats.c:#include > usr.bin/kdump/ioctl.c:#include devicestat.h doesn't even have any ioctls in it, but it is apparently needed here to supply pollution in other headers. Perhaps it is no longer even needed here. In FreeBSD-4, it was needed for the include of , which does have ioctls in it and also has a complete struct devstat in its softc. Headers with softcs in them never belonged in , and softcs never belonged in public APIs, and at least the part of this has been fixed -- no longer exists, and the only struct devstats in are an incomplete one in bio.h and of course the full one in devicestat.h. ccd was from NetBSD, and has gone away completely. FreeBSD used mainly vn at first, then md. I forget if you need bintimes to work when !__BSD_VISIBLE, or just need the headers to be self-sufficient. devicestat.h already has the latter. It includes sys/time.h. The early include of sys/time.h in kdump/ioctl.c has no effect, since it is after the include of sys/devicestat.h where sys/time.h has already been included nested. OTOH, in the kernel the include of sys/time.h in sys/devicestat.h normally has no effect, since normally sys/param.h is included earlier and it supplies sys/time.h as standard pollution. > sbin/mdconfig/mdconfig.c:#include /dev/md's softc is ugly and includes a struct devstat, but its ugliness is not public (it is only in dev/md/md.c). Otherwise it would give the same problem as ccd used to. > and also into sys/cam/, some of which is in userland too (built into libcam): > > sys/cam/ata/ata_pmp.c:#include > sys/cam/ata/ata_da.c:#include > sys/cam/scsi/scsi_pt.c:#include > sys/cam/scsi/scsi_pass.c:#include > sys/cam/scsi/scsi_targ_bh.c:#include > sys/cam/scsi/scsi_sa.c:#include > sys/cam/scsi/scsi_da.c:#include > sys/cam/scsi/scsi_target.c:#include > sys/cam/scsi/scsi_sg.c:#include > sys/cam/scsi/scsi_cd.c:#include I wonder why scsi is still doing so much with devstat. For disks, devstat handling mostly moved into geom. In the old ata driver, there are no remaining reference to struct devstat or devicestat.h in any disk driver. This works for ad, but IIRC device statistics were broken for a long time for acd. Non-disk drivers that don't go through geom must still do their own devstat handling. Scsi drivers always did, and the above shows them still doing it, but ata_da, da and cd are disk drivers so why do they do it? Old ata drivers mostly didn't, so device statistics never worked for them. Now, only the ata tape driver does its own device statistics, as it always did. All of the above are .c files and all of the struct devstats in their softc's are private, so they don't affect userland. The struct devstats in the above are: scsi/scsi_ch.c: struct devstat *device_stats; scsi/scsi_pass.c: struct devstat *device_stats; scsi/scsi_pt.c: struct devstat *device_stats; scsi/scsi_sa.c: struct devstat *device_stats; scsi/scsi_sg.c: struct devstat *device_stats; scsi/scsi_targ_bh.c: struct devstat device_stats; scsi/scsi_target.c: struct devstat device_stats; The cam/ata files don't use any devstat functions, so the includes of devicestat.h in them are apparently bogus. cd and sa do use devstat functions, so the includes of devicestat.h in them are needed, but they don't appear in the previous list because they put the devstat struct in a general disk/geom struct instead of in their softc. scsi_ch.c includes devicestat.h and needs it but is not in your list. This and the others not already described must do their own devstat handling since they are not disks. All except the targ* ones only use an incomplete struct device_stats (a pointer to a full one). For disk drivers, this indirection helps avoid polluting public disk headers with the complete declaration. It might not be useful here, but all the non-targ* scsi drivers do it. In FreeBSD-4 (before geom), _all_ scsi drivers used a complete device_stats struct in their softc. Oops, I forgot the libcam use of kernel files. It seems to use only scsi_da.c and scsi_sa.c from the above. Both of these use only an indirect struct device_stats in their softc. They can't reasonably access this in userland, and it is clear that scsi_da.c tries not to since only includes devicestat.h under a _KERNEL ifdef. scsi_sa.c is not so careful. I checked this in the userland .depend. devicestat.h is only depended on by scsi_sa.c. Hopefully it can be fixed in the same way as scsi_da.c was (I guess the latter does less with devstat because more is done in geom). Thus there seems to be no real need for devicestat.h from scsi files userland. The public interface for libdevstat is . This shouldn't export struct device_stat, but it includes and struct device_stat isn't in the _KERNEL section there. Even the libdevstat implementation never references struct device_stat, so it seems to be pure pollution in libdevstat. I think libdevstat just uses a sysctl that converts kernel struct devicestats into userland struct devstat, so userland should never see the former. However, bintimes are part of the public API (they are in struct devstat and a couple of functions). Bruce From owner-freebsd-arch@FreeBSD.ORG Mon May 28 06:36:12 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 83699106564A; Mon, 28 May 2012 06:36:12 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 3E6DB8FC0A; Mon, 28 May 2012 06:36:12 +0000 (UTC) Received: from critter.freebsd.dk (critter-phk.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id E37A0139C3; Mon, 28 May 2012 06:36:04 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id q4S6a2i8022721; Mon, 28 May 2012 06:36:03 GMT (envelope-from phk@phk.freebsd.dk) To: Bruce Evans From: "Poul-Henning Kamp" In-Reply-To: Your message of "Mon, 28 May 2012 04:00:59 +1000." <20120528023818.F2417@besplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1 Date: Mon, 28 May 2012 06:36:02 +0000 Message-ID: <22720.1338186962@critter.freebsd.dk> Cc: Konstantin Belousov , freebsd-arch@FreeBSD.org, Robert Millan Subject: Re: headers that use "struct bintime" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 May 2012 06:36:12 -0000 In message <20120528023818.F2417@besplex.bde.org>, Bruce Evans writes: >devstat has silly APIs, with long double values (despite needing the >range and precision of long doubles over doubles less than most things) >and bintimes (despite needing the precision of bintimes over less than >most things), but is well established so is hard to fix now. I take it you meant to write: devstat wisely chose sufficiently powerful data types to cover a significant of stretch of future, rather than fall in the all too common trap of skimping on data types for the sake of a few bytes ? >Headers with softcs in them >never belonged in , and softcs never belonged in public APIs, Well, that's pretty much how ioctls worked on very early versions of Unix, but I agree with you that it was architecturally wrong. >I wonder why scsi is still doing so much with devstat. For disks, >devstat handling mostly moved into geom. Mainly because SCSI also does tapes, and non-GEOM operations, such as formatting, on disks. I should have severed that link when I did GEOM, splitting devicestat into geomstat and camstat. Doing so now would still be a good idea. >I think libdevstat >just uses a sysctl that converts kernel struct devicestats into >userland struct devstat, so userland should never see the former. It mmaps /dev/devstat, so it is slightly more tangled. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Mon May 28 17:11:27 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B6B29106566C; Mon, 28 May 2012 17:11:27 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from smtprelay06.ispgateway.de (smtprelay06.ispgateway.de [80.67.31.101]) by mx1.freebsd.org (Postfix) with ESMTP id 744698FC0A; Mon, 28 May 2012 17:11:27 +0000 (UTC) Received: from [78.35.185.129] (helo=fabiankeil.de) by smtprelay06.ispgateway.de with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.68) (envelope-from ) id 1SZ3OF-0001HQ-Mj; Mon, 28 May 2012 19:05:55 +0200 Date: Mon, 28 May 2012 19:03:00 +0200 From: Fabian Keil To: gnn@freebsd.org Message-ID: <20120528190300.3a43fc8d@fabiankeil.de> In-Reply-To: <86wr40tfhf.wl%gnn@neville-neil.com> References: <86wr40tfhf.wl%gnn@neville-neil.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/JwiCtaM=HFd/B+UCSuQaqjH"; protocol="application/pgp-signature" X-Df-Sender: Nzc1MDY3 Cc: arch@freebsd.org Subject: Re: RFC: A trial io provider for DTrace... X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 May 2012 17:11:27 -0000 --Sig_/JwiCtaM=HFd/B+UCSuQaqjH Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable gnn@freebsd.org wrote: > I have just put up the first patch that can give you something similar > to the io provider in DTrace. The patch is against HEAD of about a > week ago. >=20 > You can find the patch here: freebsd.org: >=20 > http://people.freebsd.org/~gnn/dtio_provider.diff >=20 > Note that you need to create a src/sys/modules/dtrace/dtio/ directory > for this patch, since patch doesn't seem to create directories for me. Worked for me when applying with -p0. > The arguments are not exactly the same as in Solaris, for instance I > don't yet support the fileinfo_t, but, you can get to the devstat and > bio structures via args[0] and args[1] respectively. >=20 > Here is an example of it working: >=20 > dtrace -n 'io:::start /args[0] !=3D 0/{ trace(args[0]->bio_bcount)}' >=20 > Remember you need to be root to use DTrace. Do you intent to eventually commit your patch to get dtrace working with sudo? I've been using it since you posted it last October and haven't seen any issues. http://lists.freebsd.org/pipermail/freebsd-current/2011-October/028120.html > I need to clean this up and get the translators working properly > before I can check this in. >=20 > Also, note that this patch doesn't catch all I/O, but should get most > of it, as it's hooked into the devstat system. >=20 > I will be adding manual pages for the internals of DTrace to our > section 9, as well as, hopefully, writing up a wiki page on how to add > your own kernel providers. >=20 > Comments welcome. I got: clang -c -O2 -pipe -fno-strict-aliasing -std=3Dc99 -g -Wall -Wredundant-de= cls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-ar= ith -Winline -Wcast-qual /usr/src/sys/kern/subr_devstat.c:390:2: error: use of undeclared identi= fier 'bs' DTRACE_DEVSTAT_BIO_DONE(); ^ /usr/src/sys/kern/subr_devstat.c:76:41: note: expanded from macro 'DTRA= CE_DEVSTAT_BIO_DONE' (*dtrace_io_done_probe)(dtio_done_id, bs, ds); ^ 1 error generated. *** Error code 1 Stop in /usr/obj/usr/src/sys/ZOEY. *** Error code 1 Stop in /usr/src. *** [buildkernel] Error code 1 and used the following patch to get it to compile: diff --git a/sys/kern/subr_devstat.c b/sys/kern/subr_devstat.c index e2b6d21..732bf9c 100644 --- a/sys/kern/subr_devstat.c +++ b/sys/kern/subr_devstat.c @@ -73,7 +73,7 @@ uint32_t dtio_wait_done_id; #define DTRACE_DEVSTAT_BIO_DONE() \ if (dtrace_io_done_probe !=3D NULL) \ - (*dtrace_io_done_probe)(dtio_done_id, bs, ds); + (*dtrace_io_done_probe)(dtio_done_id, bp, ds); Other than that the provider seems to work fine so far. Thanks a lot. Fabian --Sig_/JwiCtaM=HFd/B+UCSuQaqjH Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAk/Dr8sACgkQBYqIVf93VJ0JFgCfTDujAxJajr4079QCmroHgWsl wDcAn12ctyO6y8hGFhLA1RwxzoB4TNNj =Jfaw -----END PGP SIGNATURE----- --Sig_/JwiCtaM=HFd/B+UCSuQaqjH-- From owner-freebsd-arch@FreeBSD.ORG Mon May 28 19:56:02 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4C4111065674; Mon, 28 May 2012 19:56:02 +0000 (UTC) (envelope-from dim@FreeBSD.org) Received: from tensor.andric.com (cl-327.ede-01.nl.sixxs.net [IPv6:2001:7b8:2ff:146::2]) by mx1.freebsd.org (Postfix) with ESMTP id 095118FC0C; Mon, 28 May 2012 19:55:58 +0000 (UTC) Received: from [IPv6:2001:7b8:3a7:0:4c1c:92fb:538c:83ed] (unknown [IPv6:2001:7b8:3a7:0:4c1c:92fb:538c:83ed]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by tensor.andric.com (Postfix) with ESMTPSA id 2E7C65C59; Mon, 28 May 2012 21:55:58 +0200 (CEST) Message-ID: <4FC3D84B.2060302@FreeBSD.org> Date: Mon, 28 May 2012 21:55:55 +0200 From: Dimitry Andric Organization: The FreeBSD Project User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20120522 Thunderbird/13.0 MIME-Version: 1.0 To: Baptiste Daroussin References: <20120526235510.GB90668@ithaqua.etoilebsd.net> In-Reply-To: <20120526235510.GB90668@ithaqua.etoilebsd.net> X-Enigmail-Version: 1.5a1pre Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: arch@FreeBSD.org Subject: Re: switch tounconditionnal boostrapping while to build the tree X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 May 2012 19:56:02 -0000 On 2012-05-27 01:55, Baptiste Daroussin wrote: > After I replace yacc(1) by byacc(1) on current, we discovered than now it is > impossible to build 9 on current, because byacc(1) is not 100% backward > compatible with our yacc(1). this is because building a boostrap yacc(1) is > conditionned on the version of the host that is building world. > > Looking at Makefile.inc1 I can see that lots of tools are conditionned like > this. I think if we want to go to be able to cross build the tree (I remember > from EuroBSDcon that this is something we want to do) then we need to remove the > conditions and always boostrap any tool necessary to be able to build the tree. > > so if no one care I'll remove the condition to boostrap at least yacc(1) and > lex(1) on current, 9, 8 and 7. > > Would be great imho to do the same for any tools needed by the build system. It could prevent a lot of subtle (and not to subtle :) problems, but it will also waste a lot of CPU time and energy building stuff that isn't strictly needed. (I'm saying this with tongue in cheek, since I'm responsible for a lot of CPU wastage, a.k.a. clang... ;) E.g., the bootstrapping version check mechanism which is now in place, is really a build time optimization, comparable to running builds with NO_CLEAN: you can shoot yourself in the foot, it's dirty, but it works most of the time, and it is *much* faster. I really would not want to throw all that away. But as a compromise, you could add an option to do "brute force bootstrapping", which ignores all version checking, and just builds all required bootstrap tools. The question is also what your end goal is: do you want to reach a NetBSD style approach (basically bootstrap *everything*), or just make the current implementation more robust? From owner-freebsd-arch@FreeBSD.ORG Mon May 28 20:07:09 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 88825106566B; Mon, 28 May 2012 20:07:09 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id 4A6D78FC14; Mon, 28 May 2012 20:07:09 +0000 (UTC) Received: from ds4.des.no (smtp.des.no [194.63.250.102]) by smtp.des.no (Postfix) with ESMTP id 445016171; Mon, 28 May 2012 20:07:08 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 08F7B8DB0; Mon, 28 May 2012 22:07:07 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: Baptiste Daroussin References: <20120526235510.GB90668@ithaqua.etoilebsd.net> Date: Mon, 28 May 2012 22:07:07 +0200 In-Reply-To: <20120526235510.GB90668@ithaqua.etoilebsd.net> (Baptiste Daroussin's message of "Sun, 27 May 2012 01:55:10 +0200") Message-ID: <86aa0s83ys.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: arch@FreeBSD.org Subject: Re: switch tounconditionnal boostrapping while to build the tree X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 May 2012 20:07:09 -0000 Baptiste Daroussin writes: > so if no one care I'll remove the condition to boostrap at least > yacc(1) and lex(1) on current, 9, 8 and 7. I should interject that I've already added code to Makefile.inc1 in 7, 8 and 9 so yacc is a bootstrap tool when building on a system that has byacc, so right now Baptiste's proposed change won't make much difference, but we should definitely give some thought to what we consider a bootstrap tool and when we should build them. Blindly removing all conditionals is *not* an option, though, as some of the bootstrap tools take ages to build. Remember that all bootstrap tools and build tools are built twice, once during the bootstrap / toolchain phase and once during the everything phase. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Wed May 30 12:12:03 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 73805106566B; Wed, 30 May 2012 12:12:03 +0000 (UTC) (envelope-from johnandsara2@cox.net) Received: from eastrmfepo203.cox.net (eastrmfepo203.cox.net [68.230.241.218]) by mx1.freebsd.org (Postfix) with ESMTP id EDD938FC0C; Wed, 30 May 2012 12:12:02 +0000 (UTC) Received: from eastrmimpo110.cox.net ([68.230.241.223]) by eastrmfepo203.cox.net (InterMail vM.8.01.04.00 201-2260-137-20101110) with ESMTP id <20120530121202.GPGW18532.eastrmfepo203.cox.net@eastrmimpo110.cox.net>; Wed, 30 May 2012 08:12:02 -0400 Received: from [192.168.3.22] ([70.177.172.35]) by eastrmimpo110.cox.net with bizsmtp id GCC11j00Q0mAvba02CC288; Wed, 30 May 2012 08:12:02 -0400 X-CT-Class: Clean X-CT-Score: 0.00 X-CT-RefID: str=0001.0A020208.4FC60E92.008B,ss=1,re=0.000,fgs=0 X-CT-Spam: 0 X-Authority-Analysis: v=1.1 cv=s1i2RV+unmn3sLkEA3lf1Tj2LikDbZyRf9iEFo2x6J8= c=1 sm=1 a=f5xKl4ys9bwA:10 a=_shUJCvoDt8A:10 a=G8Uczd0VNMoA:10 a=Wajolswj7cQA:10 a=8nJEP1OIZ-IA:10 a=alU6Bxxa4qBWIf+k8j/ISQ==:17 a=HzI0Pm0Nd4Mf1Gaf9u8A:9 a=wPNLvfGTeEIA:10 a=alU6Bxxa4qBWIf+k8j/ISQ==:117 X-CM-Score: 0.00 Authentication-Results: cox.net; none Message-ID: <4FC60E8C.1070204@cox.net> Date: Wed, 30 May 2012 08:11:56 -0400 From: "John D. Hendrickson and Sara Darnell" User-Agent: Thunderbird 2.0.0.24 (X11/20100228) MIME-Version: 1.0 To: Baptiste Daroussin References: <20120526235510.GB90668@ithaqua.etoilebsd.net> In-Reply-To: <20120526235510.GB90668@ithaqua.etoilebsd.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@FreeBSD.org Subject: Re: switch tounconditionnal boostrapping while to build the tree X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: johnandsara2@cox.net List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2012 12:12:03 -0000 i find the statements hard to believe why are you doing it that way ? (using a broken yacc) why do you believe it's ok to change previous releases that don't use that yacc ? Baptiste Daroussin wrote: > Hi > > After I replace yacc(1) by byacc(1) on current, we discovered than now it is > impossible to build 9 on current, because byacc(1) is not 100% backward > compatible with our yacc(1). this is because building a boostrap yacc(1) is > conditionned on the version of the host that is building world. > > Looking at Makefile.inc1 I can see that lots of tools are conditionned like > this. I think if we want to go to be able to cross build the tree (I remember > from EuroBSDcon that this is something we want to do) then we need to remove the > conditions and always boostrap any tool necessary to be able to build the tree. > > so if no one care I'll remove the condition to boostrap at least yacc(1) and > lex(1) on current, 9, 8 and 7. > > Would be great imho to do the same for any tools needed by the build system. > > regards, > Bapt From owner-freebsd-arch@FreeBSD.ORG Wed May 30 13:02:45 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 27043106566C for ; Wed, 30 May 2012 13:02:45 +0000 (UTC) (envelope-from bapt@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 0B17F8FC25; Wed, 30 May 2012 13:02:45 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q4UD2id6041053; Wed, 30 May 2012 13:02:44 GMT (envelope-from bapt@FreeBSD.org) Received: (from bapt@localhost) by freefall.freebsd.org (8.14.5/8.14.5/Submit) id q4UD2iQc041037; Wed, 30 May 2012 13:02:44 GMT (envelope-from bapt@FreeBSD.org) X-Authentication-Warning: freefall.freebsd.org: bapt set sender to bapt@FreeBSD.org using -f Date: Wed, 30 May 2012 15:02:41 +0200 From: Baptiste Daroussin To: "John D. Hendrickson and Sara Darnell" Message-ID: <20120530130241.GH9952@ithaqua.etoilebsd.net> References: <20120526235510.GB90668@ithaqua.etoilebsd.net> <4FC60E8C.1070204@cox.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="+Z7/5fzWRHDJ0o7Q" Content-Disposition: inline In-Reply-To: <4FC60E8C.1070204@cox.net> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: arch@FreeBSD.org Subject: Re: switch tounconditionnal boostrapping while to build the tree X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2012 13:02:45 -0000 --+Z7/5fzWRHDJ0o7Q Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, May 30, 2012 at 08:11:56AM -0400, John D. Hendrickson and Sara Darn= ell wrote: > i find the statements hard to believe >=20 > why are you doing it that way ? (using a broken yacc) It is not a broken yacc. the yacc import just revealed another problem whic= h is boostrap tools may needs to be always boostraped (which makes sense if you really want to support cross-compilation from nearly anywhere. > why do you believe it's ok to change previous releases that don't use tha= t yacc ? To build able to build them on a system that do not have the yacc version t= hey need, that system could be linux for example or it could be a recent head regards, Bapt --+Z7/5fzWRHDJ0o7Q Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAk/GGnEACgkQ8kTtMUmk6EwUGQCgldam/3Nt535vMIi9DcHoGy9S q58An0ztuebkuqaVYDn0bTZ19KoPXiFu =ZpeD -----END PGP SIGNATURE----- --+Z7/5fzWRHDJ0o7Q-- From owner-freebsd-arch@FreeBSD.ORG Wed May 30 22:01:31 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C98D7106564A; Wed, 30 May 2012 22:01:31 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 7DDC18FC12; Wed, 30 May 2012 22:01:31 +0000 (UTC) Received: from [10.30.101.53] ([209.117.142.2]) (authenticated bits=0) by harmony.bsdimp.com (8.14.4/8.14.3) with ESMTP id q4ULmXRa048852 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES128-SHA bits=128 verify=NO); Wed, 30 May 2012 15:48:35 -0600 (MDT) (envelope-from imp@bsdimp.com) Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Warner Losh In-Reply-To: <20120530130241.GH9952@ithaqua.etoilebsd.net> Date: Wed, 30 May 2012 15:48:28 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: <722ECB48-6C82-4FF0-AC18-02910DBD0B66@bsdimp.com> References: <20120526235510.GB90668@ithaqua.etoilebsd.net> <4FC60E8C.1070204@cox.net> <20120530130241.GH9952@ithaqua.etoilebsd.net> To: Baptiste Daroussin X-Mailer: Apple Mail (2.1084) X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (harmony.bsdimp.com [10.0.0.6]); Wed, 30 May 2012 15:48:35 -0600 (MDT) Cc: arch@FreeBSD.org, "John D. Hendrickson and Sara Darnell" Subject: Re: switch tounconditionnal boostrapping while to build the tree X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2012 22:01:31 -0000 On May 30, 2012, at 7:02 AM, Baptiste Daroussin wrote: > On Wed, May 30, 2012 at 08:11:56AM -0400, John D. Hendrickson and Sara = Darnell wrote: >> i find the statements hard to believe >>=20 >> why are you doing it that way ? (using a broken yacc) > It is not a broken yacc. the yacc import just revealed another problem = which is > boostrap tools may needs to be always boostraped (which makes sense if = you > really want to support cross-compilation from nearly anywhere. Cross build support doesn't require that you break things like that. = Never had, and never will. The FreeBSD version is irrelevant to cross = building, so bootstrapping checks are still needed. In the cross build, = the bootstrapping OS version will be 0 and we'll build everything we = need (possibly more than we would bootstrapping from supported FreeBSD = versions). >> why do you believe it's ok to change previous releases that don't use = that yacc ? >=20 > To build able to build them on a system that do not have the yacc = version they > need, that system could be linux for example or it could be a recent = head. You can accomplish this without blowing away the conditionals. Warner From owner-freebsd-arch@FreeBSD.ORG Thu May 31 05:29:19 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1419D106566B for ; Thu, 31 May 2012 05:29:19 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id AEE158FC16 for ; Thu, 31 May 2012 05:29:18 +0000 (UTC) Received: from 63.imp.bsdimp.com (63.imp.bsdimp.com [10.0.0.63]) (authenticated bits=0) by harmony.bsdimp.com (8.14.4/8.14.3) with ESMTP id q4V5RImQ051940 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES128-SHA bits=128 verify=NO) for ; Wed, 30 May 2012 23:27:18 -0600 (MDT) (envelope-from imp@bsdimp.com) From: Warner Losh Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Date: Wed, 30 May 2012 23:27:18 -0600 Message-Id: To: arch@FreeBSD.org Mime-Version: 1.0 (Apple Message framework v1084) X-Mailer: Apple Mail (2.1084) X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (harmony.bsdimp.com [10.0.0.6]); Wed, 30 May 2012 23:27:18 -0600 (MDT) Cc: Subject: rman_await_resource X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 May 2012 05:29:19 -0000 Is anybody using rman_await_resource? I can see no in-tree users, and = the code's locking looks dubious. I'd like to just delete it. Warner From owner-freebsd-arch@FreeBSD.ORG Thu May 31 12:24:25 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9B2A31065680; Thu, 31 May 2012 12:24:25 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 6AD5C8FC0A; Thu, 31 May 2012 12:24:24 +0000 (UTC) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id PAA01061; Thu, 31 May 2012 15:23:57 +0300 (EEST) (envelope-from avg@FreeBSD.org) Message-ID: <4FC762DD.90101@FreeBSD.org> Date: Thu, 31 May 2012 15:23:57 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20120503 Thunderbird/12.0.1 MIME-Version: 1.0 To: Christoph Hellwig , d@delphij.net, freebsd-arch@FreeBSD.org References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no> <20120517055425.GA802@infradead.org> In-Reply-To: <20120517055425.GA802@infradead.org> X-Enigmail-Version: 1.5pre Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: Eitan Adler , Adrian Chadd , =?ISO-8859-1?Q?Dag-Erling_Sm=F8?=, =?ISO-8859-1?Q?rgrav?= Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged process? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 May 2012 12:24:25 -0000 on 17/05/2012 08:54 Christoph Hellwig said the following: > Linux has added a RLIMIT_MEMLOCK opcode for setrlimit that allows > controlling the amount of memory users can lock down, with a default > of a single page for unprivilegued processes. In fact, FreeBSD also has this rlimit and there seems to be full support for it on both user and kernel sides. OTOH, PRIV_VM_MLOCK privilege seems to be granted only to the super-user in the default configuration. And this privilege kind of defeats the limit. Perhaps, we should/could kill the privilege and set the limit to a sufficiently small/safe value for ordinary users? P.S. Some MAC code has this comment: /* * Allow VM privileges; it would be nice if these were subject to * resource limits. */ case PRIV_VM_MADV_PROTECT: case PRIV_VM_MLOCK: In the case of PRIV_VM_MLOCK it would be nice if one hand knew what the other is doing :-) P.P.S. I would really like to see RLIMIT_NICE and RLIMIT_RTPRIO in FreeBSD. -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Thu May 31 16:23:27 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4FC86106564A for ; Thu, 31 May 2012 16:23:27 +0000 (UTC) (envelope-from gnn@freebsd.org) Received: from vps.hungerhost.com (vps.hungerhost.com [216.38.53.176]) by mx1.freebsd.org (Postfix) with ESMTP id 117A78FC0C for ; Thu, 31 May 2012 16:23:27 +0000 (UTC) Received: from [209.249.190.124] (port=55088 helo=[10.2.212.229]) by vps.hungerhost.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.77) (envelope-from ) id 1Sa89g-0004dH-F6; Thu, 31 May 2012 12:23:23 -0400 Mime-Version: 1.0 (Apple Message framework v1278) Content-Type: text/plain; charset=us-ascii From: George Neville-Neil In-Reply-To: <20120528190300.3a43fc8d@fabiankeil.de> Date: Thu, 31 May 2012 12:23:22 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: References: <86wr40tfhf.wl%gnn@neville-neil.com> <20120528190300.3a43fc8d@fabiankeil.de> To: Fabian Keil X-Mailer: Apple Mail (2.1278) X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - vps.hungerhost.com X-AntiAbuse: Original Domain - freebsd.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - freebsd.org Cc: arch@freebsd.org Subject: Re: RFC: A trial io provider for DTrace... X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 May 2012 16:23:27 -0000 On May 28, 2012, at 13:03 , Fabian Keil wrote: > Worked for me when applying with -p0. >=20 Great! >> The arguments are not exactly the same as in Solaris, for instance I >> don't yet support the fileinfo_t, but, you can get to the devstat and >> bio structures via args[0] and args[1] respectively. >>=20 >> Here is an example of it working: >>=20 >> dtrace -n 'io:::start /args[0] !=3D 0/{ trace(args[0]->bio_bcount)}' >>=20 >> Remember you need to be root to use DTrace. >=20 > Do you intent to eventually commit your patch to get dtrace working > with sudo? I've been using it since you posted it last October and > haven't seen any issues. > = http://lists.freebsd.org/pipermail/freebsd-current/2011-October/028120.htm= l >=20 Sorry, what I meant was that you needed root privilege to run DTrace, sudo will give you that. > I got: >=20 > clang -c -O2 -pipe -fno-strict-aliasing -std=3Dc99 -g -Wall = -Wredundant-decls -Wnested-externs -Wstrict-prototypes = -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual > /usr/src/sys/kern/subr_devstat.c:390:2: error: use of undeclared = identifier 'bs' > DTRACE_DEVSTAT_BIO_DONE(); > ^ > /usr/src/sys/kern/subr_devstat.c:76:41: note: expanded from macro = 'DTRACE_DEVSTAT_BIO_DONE' > (*dtrace_io_done_probe)(dtio_done_id, bs, ds); > ^ > 1 error generated. > *** Error code 1 >=20 > Stop in /usr/obj/usr/src/sys/ZOEY. > *** Error code 1 >=20 > Stop in /usr/src. > *** [buildkernel] Error code 1 >=20 > and used the following patch to get it to compile: >=20 > diff --git a/sys/kern/subr_devstat.c b/sys/kern/subr_devstat.c > index e2b6d21..732bf9c 100644 > --- a/sys/kern/subr_devstat.c > +++ b/sys/kern/subr_devstat.c > @@ -73,7 +73,7 @@ uint32_t dtio_wait_done_id; >=20 > #define DTRACE_DEVSTAT_BIO_DONE() \ > if (dtrace_io_done_probe !=3D NULL) \ > - (*dtrace_io_done_probe)(dtio_done_id, bs, ds); > + (*dtrace_io_done_probe)(dtio_done_id, bp, ds); >=20 > Other than that the provider seems to work fine so far. OK, let me get that fixed up and put up a newer patch. Thanks for testing and replying! Best, George From owner-freebsd-arch@FreeBSD.ORG Thu May 31 18:08:47 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 93901106564A for ; Thu, 31 May 2012 18:08:47 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from vps.hungerhost.com (vps.hungerhost.com [216.38.53.176]) by mx1.freebsd.org (Postfix) with ESMTP id 61FCD8FC14 for ; Thu, 31 May 2012 18:08:47 +0000 (UTC) Received: from [209.249.190.124] (port=56608 helo=[10.2.212.229]) by vps.hungerhost.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.77) (envelope-from ) id 1Sa9ni-0006Aa-8q; Thu, 31 May 2012 14:08:46 -0400 Mime-Version: 1.0 (Apple Message framework v1278) Content-Type: text/plain; charset=us-ascii From: George Neville-Neil In-Reply-To: Date: Thu, 31 May 2012 14:08:47 -0400 Content-Transfer-Encoding: 7bit Message-Id: <738E93BC-3BEE-4792-9249-C2233EE8D7C6@neville-neil.com> References: <86wr40tfhf.wl%gnn@neville-neil.com> <20120528190300.3a43fc8d@fabiankeil.de> To: Fabian Keil X-Mailer: Apple Mail (2.1278) X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - vps.hungerhost.com X-AntiAbuse: Original Domain - freebsd.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - neville-neil.com Cc: arch@freebsd.org Subject: Re: RFC: A trial io provider for DTrace... X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 May 2012 18:08:47 -0000 OK, the latest patch, with Fabien's fix, is up at: http://people.freebsd.org/~gnn/dtio_provider_2.diff Best, George From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 01:46:10 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx2.freebsd.org (mx2.freebsd.org [69.147.83.53]) by hub.freebsd.org (Postfix) with ESMTP id A0C471065670; Fri, 1 Jun 2012 01:46:10 +0000 (UTC) (envelope-from dougb@FreeBSD.org) Received: from [127.0.0.1] (hub.freebsd.org [IPv6:2001:4f8:fff6::36]) by mx2.freebsd.org (Postfix) with ESMTP id DFD4CB2CE4; Fri, 1 Jun 2012 01:40:43 +0000 (UTC) Message-ID: <4FC81D9C.2080801@FreeBSD.org> Date: Thu, 31 May 2012 18:40:44 -0700 From: Doug Barton Organization: http://www.FreeBSD.org/ User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 MIME-Version: 1.0 To: Andriy Gapon References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no> <20120517055425.GA802@infradead.org> <4FC762DD.90101@FreeBSD.org> In-Reply-To: <4FC762DD.90101@FreeBSD.org> X-Enigmail-Version: 1.4.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: =?ISO-8859-1?Q?Dag-Erling_Sm=F8?=@FreeBSD.ORG, Adrian Chadd , d@delphij.net, Eitan Adler , freebsd-arch@FreeBSD.org, rgrav Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged process? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2012 01:46:10 -0000 On 5/31/2012 5:23 AM, Andriy Gapon wrote: > In fact, FreeBSD also has this rlimit and there seems to be full support for it on > both user and kernel sides. > OTOH, PRIV_VM_MLOCK privilege seems to be granted only to the super-user in the > default configuration. And this privilege kind of defeats the limit. > > Perhaps, we should/could kill the privilege and set the limit to a sufficiently > small/safe value for ordinary users? I like this idea, but someone else in the thread (sorry, don't have it handy) brought up the point that we don't want the aggregate of per-user limits to be able to bring down the system either. So the right solution would seem to be a reasonable per-user limit, and a cap on the maximum total amount of locked pages for all unprivileged users, probably based on some percentage of total available memory? Doug -- This .signature sanitized for your protection From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 15:41:22 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5BE221065679 for ; Fri, 1 Jun 2012 15:41:22 +0000 (UTC) (envelope-from bryan@shatow.net) Received: from secure.xzibition.com (secure.xzibition.com [173.160.118.92]) by mx1.freebsd.org (Postfix) with ESMTP id A98AE8FC12 for ; Fri, 1 Jun 2012 15:41:21 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=shatow.net; h=message-id :date:from:mime-version:to:cc:subject:references:in-reply-to :content-type; q=dns; s=sweb; b=OQz+T/sQiaxmoD9DZU2K7Wgxou2gtSX9 6EPyS2MoR2nEHhjfLHYEtTyS9XFb8LDzPHCkI/1obWkDdGa2M7NanfWm42P9A+xF 0EMgTbh6M2YYtG6vuRwNEIoFyuVVuGNGv2fGI2ygBIjeou0yC4QDaiS/oNivG7NK FbPzeHOXuEI= DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=shatow.net; h=message-id :date:from:mime-version:to:cc:subject:references:in-reply-to :content-type; s=sweb; bh=kBrTc/W9qIgsyOKMzK/PVWXkHzwP4HWcGp7Xnh UzNmk=; b=swaLiDY9Fyw/S2EihR7Km6i3EJDI9uFyI+o7OkIYu6w6SJ8ozOb7gS 6nb4n+W3UZVdYzp7CcfDQ7sl3fSOYynu9aLPkiH4W8NPhvuOqa/fPrpuvIe7V8oe F8mjRPRZu6g9rS5Zn3wIlsWJNFi0ijBz7NxnnjZFVsfL9OnDK4JTk= Received: (qmail 3984 invoked from network); 1 Jun 2012 10:41:17 -0500 Received: from unknown (HELO ?192.168.21.109?) (bryan@shatow.net@74.94.87.209) by sweb.xzibition.com with ESMTPA; 1 Jun 2012 10:41:17 -0500 Message-ID: <4FC8E29F.2010806@shatow.net> Date: Fri, 01 Jun 2012 10:41:19 -0500 From: Bryan Drewery User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 MIME-Version: 1.0 To: Doug Barton References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no> <20120517055425.GA802@infradead.org> <4FC762DD.90101@FreeBSD.org> <4FC81D9C.2080801@FreeBSD.org> In-Reply-To: <4FC81D9C.2080801@FreeBSD.org> X-Enigmail-Version: 1.4.1 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enigB651918E900EB354EF176708" Cc: =?ISO-8859-1?Q?Dag-Erling_Sm=F8?=@FreeBSD.ORG, Adrian Chadd , d@delphij.net, Andriy Gapon , Eitan Adler , freebsd-arch@FreeBSD.org, rgrav Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged process? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2012 15:41:22 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigB651918E900EB354EF176708 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 5/31/2012 8:40 PM, Doug Barton wrote: > On 5/31/2012 5:23 AM, Andriy Gapon wrote: >> In fact, FreeBSD also has this rlimit and there seems to be full suppo= rt for it on >> both user and kernel sides. >> OTOH, PRIV_VM_MLOCK privilege seems to be granted only to the super-us= er in the >> default configuration. And this privilege kind of defeats the limit. >> >> Perhaps, we should/could kill the privilege and set the limit to a suf= ficiently >> small/safe value for ordinary users? >=20 > I like this idea, but someone else in the thread (sorry, don't have it > handy) brought up the point that we don't want the aggregate of per-use= r > limits to be able to bring down the system either. So the right solutio= n > would seem to be a reasonable per-user limit, and a cap on the maximum > total amount of locked pages for all unprivileged users, probably based= > on some percentage of total available memory? >=20 > Doug >=20 I like this approach. A per-user ulimit, and a global max sysctl that can be overridden, but by default based on a percentage of available memo= ry. --=20 Regards, Bryan Drewery --------------enigB651918E900EB354EF176708 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJPyOKjAAoJEG54KsA8mwz5DBcP/2Z14YhYXnsnl2h3yAIcrB04 89cEVNWqeSaRhRrenGkTDI3qhpzd19D/huugd50YT9L+HJUehmBbL8kL0a6tc0KF 8COlXldFOWL1v3TmXgbkirE9+eEp1AoGh/f/SiKDBLPwufLMOO/NMvElPSgkofV1 sVFy56824PELgaK0aUeqNYSM+VzCGlgetVCJuyBSs6TguBIp21A9/W+UIfRb3ZLI mdVIjhZyzHMzFz8PbdSkVv7PMoCW/hEhHELDZTgiVShX7UjbE7rTmOQoOILPgv/B xPgUv6FdSD3OkRBy1v0TXunnj8ztdolEU0rpkBQASFI0meoYcAnh9ixvLZESK9Rt remsIzaynZOqnOfATuPT9ukehf52Yz1O2qTH148H9Ija9+V0gI0n0SpXnu4RHQ92 fCwGHGNq0yw1LmvzA1qWPRRXc+RcVERowPLA0ILCwCwtUFBUnymy4qdZsmJyNLZ7 SpB5DMTM6vB9eiUrOGdFUfh/xqQDNcMJcPuWlUTHrzHADkKe+Qch4QhIg7q5shBK 46a5BT4IFeEqjNZuNZm/jfF7FsIPcCweerwHpM46d12COj2iglMgy/BFuuBmVjSJ jtfltEZI3FmCfIZOWzZfbnDhreVdE+ATESD49PKOyINDv7K2UvMfrg5O7ywSY61n m5goeGBBQ7E5suuiniGj =jY8i -----END PGP SIGNATURE----- --------------enigB651918E900EB354EF176708-- From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 17:53:17 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9E818106566B; Fri, 1 Jun 2012 17:53:17 +0000 (UTC) (envelope-from giovanni.trematerra@gmail.com) Received: from mail-qc0-f182.google.com (mail-qc0-f182.google.com [209.85.216.182]) by mx1.freebsd.org (Postfix) with ESMTP id 03AA58FC15; Fri, 1 Jun 2012 17:53:16 +0000 (UTC) Received: by qcsg15 with SMTP id g15so1553323qcs.13 for ; Fri, 01 Jun 2012 10:53:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:cc:content-type; bh=8/It4euG+yHm4eSPneqe8eBPwiZ6874h6vDAXdbNymI=; b=t2vR3lmjLV37of+Db1iJPqKztI50KWnG/QRrjd2HodnbJfz+dwqvDVa2Mlews3J8L+ wgWfFeJqGOB/2mjX/e+HVydi429ZkjzXKkmfrt/li1o5sT+5Zp/ylxJTMPANymCtWIVR mASMAbo6Pp8PCOzSbLD9zvaJN35S6nBUPNzjcRpulOC3HxjeJwWikuqND2CCgyJGBu+V 6O6Dp3/WKgli55pvzLu+Tg1W06Wc5mYUzHtnR6PhRG5cO0Ia1NGwFzC068TRJffM9nI+ THnHv5hBPwkL0vANmoe/8rLcqsgpM6lKHIZ+JmKaJ8poE/Y9HCgjoTdeHGpJ3qJB4V4x jxag== MIME-Version: 1.0 Received: by 10.224.184.82 with SMTP id cj18mr3089654qab.81.1338573195923; Fri, 01 Jun 2012 10:53:15 -0700 (PDT) Received: by 10.229.160.20 with HTTP; Fri, 1 Jun 2012 10:53:15 -0700 (PDT) Date: Fri, 1 Jun 2012 19:53:15 +0200 Message-ID: From: Giovanni Trematerra To: freebsd-arch@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Cc: Attilio Rao , alc@freebsd.org, Konstantin Belousov , Alexander Kabaev Subject: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2012 17:53:17 -0000 Hello, I'd like to discuss a way to provide a mechanism to share some read-only data between kernel and user space programs avoiding syscall overhead, implementing some them, such as gettimeofday(3) and time(3) as ordinary user space routine. The patch at http://www.trematerra.net/patches/ksvar_experimental.patch is in a very experimental stage. It's just a proof-of-concept. Only works for an AMD64 kernel and only for 64-bit applications. The idea is to have all the variables that we want to share between kernel and user space into one or more consecutive pages of memory that will be mapped read-only into every running process. At the start of the first shared page there'll be a table with as many entries as the number of the shared variables. Each entry is a 32-bit value that is the offset between the start of the shared page and the start of the variable in the page. The user space processes need to find out the map address of shared page and use the table to access to the shared variables. Kernel will export a variable to user space as an index, so user space code must refer to a specific index to access a kernel shared variable. Let's take a quick look to the KPI/API for exporting/importing kernel shared variables. Say we want implement a routine to export an int from the kernel. To define the variable to be exported inside the kernel you would use KSVAR_DEFINE(0, int, test_value); You have just defined an int variable named "test_value" at index 0. Inside the kernel you can write/read as usual using the symbol test_value; Now you likely want add to libc a function callable from user processes that return the test_value variable. So first of all you need the import the variable. KSVAR_IMPORT(0, int, test_value); and to obtain a pointer to read the value you would use KSVAR(test_value); so your function would look like something like this int get_test_value() { return (*KSVAR(test_value)); } Then inside your process just call get_test_value() function as you usually do and you'll get a kernel written value without switching in kernel mode. Let's see now in more detail how that could be accomplished. The shared variables will be accessed as normal variables and are read/write inside the kernel. The variables need to be inside the same page(s) and nothing but the shared variables (and the table) must be into the page(s). To obtain that I changed the linker script in this way --- a/sys/conf/ldscript.amd64 +++ b/sys/conf/ldscript.amd64 @@ -177,6 +177,15 @@ SECTIONS *(.ldata .ldata.* .gnu.linkonce.l.*) . = ALIGN(. != 0 ? 64 / 8 : 1); } + .ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) : + { + __ksvar_set_start = .; + *(.ksvar_table) + *(.ksvar) + + . = ALIGN(CONSTANT (COMMONPAGESIZE)); + __ksvar_set_stop = .; + } . = ALIGN(64 / 8); _end = .; PROVIDE (end = .); . = DATA_SEGMENT_END (.); When we want to define a variable in the kernel to share with user space we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h +struct ksvar_set { + uint32_t idx; + char *pksvar; +}; + +/* + * Declare a variable into kernel shared linker_set. + */ +#define KSVAR_DEFINE(index, type, name) \ + static type name __section(".ksvar"); \ + static struct ksvar_set name ## _ksvar_set = { \ + .idx = index, \ + .pksvar = (char *) &name \ + }; \ + DATA_SET(ksvar_set, name ## _ksvar_set) Every variable must have a unique index. The indexes must start from zero and be consecutive. When you add an index you must bump the size of the table (KSVAR_TABLE_SIZE) (see sys/sys/ksvar.h) The variables are inside the kernel static image that isn't managed by the VM and so we need to allocate pages to map the physical addresses. A new SYSINIT (ksvarinit) will allocate a set of vm_page_t through the vm_phys_fictitious_reg_range interface and fill the table using the information of the ksvar_set linker set, then will create a vm_object_t (vm_object_ksvar), mark the fake pages as valid and put them into it. When a new process is created by exec(3) the vm_object_ksvar will be mapped read-only into the process address space by vm_map_fixed routine just before mapping the user stack. The address of mapping will be recorded inside the new p_ksvar field of the struct proc. This field will be exported through a sysctl to the user space processes. In order to implement syscalls as user space routines, we have to find out the mapped address of the kernel shared variables when the libc is mapped into the process. So I added a function marked with the attribute constructor. It will called before any code into user process and before any code inside the libc. +__attribute((constructor)) void init_kernel_shared() +{ + int mib[2]; + size_t len; + vm_offset_t ksvar_address; + + mib[0] = CTL_KERN; + mib[1] = KERN_KSVAR; + len = sizeof(vm_offset_t); + if (__sysctl(mib, 2, (void *) &ksvar_address, &len, NULL, 0) != -1) + ksvar_table = (uint32_t *) ksvar_address; +} Once the libc knows the address of the table it can access to the shared variables. Just as proof of concept I re-implemented gettimeofday(3) in user space. First of all I didn't remove the entry into the syscall.master, just renamed the sys_gettimeofday. I need it for the fallback path. In the kernel I introduced a struct wall_clock. +struct wall_clock +{ + struct timeval tv; + struct timezone tz; +}; The struct is exported through sys/sys/time.h header. I defined a new kernel shared variable. To do so I added an index in sys/sys/ksvar.h WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1. In the sys/kern/kern_clocksource.c +/* kernel shared variable for implmenting gettimeofday. */ +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock); Now we defined a shared variable at index WALL_CLOCK_INDEX of type struct wall_clock and named wall_clock. Inside handleevents I update the info exported by wall_clock. + struct timeval tv; + + /* update time for userspace gettimeofday */ + microtime(&tv); + wall_clock.tv = tv; + wall_clock.tz.tz_minuteswest = tz_minuteswest; + wall_clock.tz.tz_dsttime = tz_dsttime; Now, in libc we import the shared variable +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock); note that WALL_CLOCK_INDEX must be the same of the one defined inside the kernel, and define a new function gettimeofday +int +gettimeofday(struct timeval *tp, struct timezone *tzp) +{ + + /* fallback to syscall if kernel doesn't export ksvar */ + if (!KSVAR_IS_ACTIVE()) + return (sys_gettimeofday(tp, tzp)); + + if (tp != NULL) + *tp = KSVAR(wall_clock)->tv; + if (tzp != NULL) + *tzp = KSVAR(wall_clock)->tz; + return (0); +} Now when a process will call getimeofday, will call that function actually. If the process makes a lot of call to gettimeofday, we will see a performance boost. Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE), the function fallback to call the actual syscall (sys_gettimeofday). Open tasks - implement support for 32-bit emulated processes running in a 64-bit environment. - extend support to others arch - implement more syscalls - benchmarks - Test, test, test. I'm looking forward to hear about your comments and suggestions. -- Gianni From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 19:22:30 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1DEA3106566B; Fri, 1 Jun 2012 19:22:30 +0000 (UTC) (envelope-from lev@serebryakov.spb.ru) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [IPv6:2a01:4f8:131:60a2::2]) by mx1.freebsd.org (Postfix) with ESMTP id A761A8FC18; Fri, 1 Jun 2012 19:22:29 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:756c:80e7:ffb9:a0c4]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 1E1304AC2D; Fri, 1 Jun 2012 23:22:26 +0400 (MSK) Date: Fri, 1 Jun 2012 23:22:20 +0400 From: Lev Serebryakov X-Priority: 3 (Normal) Message-ID: <681265513.20120601232220@serebryakov.spb.ru> To: Giovanni Trematerra In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=windows-1251 Content-Transfer-Encoding: quoted-printable Cc: Attilio Rao , alc@freebsd.org, Konstantin Belousov , Alexander Kabaev , freebsd-arch@freebsd.org Subject: Re: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2012 19:22:30 -0000 Hello, Giovanni. You wrote 1 =E8=FE=ED=FF 2012 =E3., 21:53:15: GT> I'm looking forward to hear about your comments and suggestions. It is great, that you start this work! This approach was discussed several times according to my memory, as way to make cheap sysclass like gettimeofday() (because some Linux-orientet programs like to call them very often and have them cheap is very good idea), and every time conclusion was, that it is very good approach, but without any resulting code. --=20 // Black Lion AKA Lev Serebryakov From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 19:35:30 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A8B9D1065741; Fri, 1 Jun 2012 19:35:30 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id CAF758FC17; Fri, 1 Jun 2012 19:35:29 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q51JZMlU093011; Fri, 1 Jun 2012 22:35:22 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q51JZMFW056808; Fri, 1 Jun 2012 22:35:22 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q51JZMsi056807; Fri, 1 Jun 2012 22:35:22 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 1 Jun 2012 22:35:22 +0300 From: Konstantin Belousov To: Giovanni Trematerra Message-ID: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> References: Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="SRK8lRENmpuaYFQC" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: Attilio Rao , alc@freebsd.org, Alexander Kabaev , freebsd-arch@freebsd.org Subject: Re: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2012 19:35:30 -0000 --SRK8lRENmpuaYFQC Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote: > Hello, > I'd like to discuss a way to provide a mechanism to share some read-only > data between kernel and user space programs avoiding syscall overhead, > implementing some them, such as gettimeofday(3) and time(3) as ordinary > user space routine. >=20 > The patch at > http://www.trematerra.net/patches/ksvar_experimental.patch >=20 > is in a very experimental stage. It's just a proof-of-concept. > Only works for an AMD64 kernel and only for 64-bit applications. > The idea is to have all the variables that we want to share between kernel > and user space into one or more consecutive pages of memory that will be > mapped read-only into every running process. At the start of the first > shared page > there'll be a table with as many entries as the number of the shared vari= ables. > Each entry is a 32-bit value that is the offset between the start of the = shared > page and the start of the variable in the page. The user space processes = need > to find out the map address of shared page and use the table to access to= the > shared variables. > Kernel will export a variable to user space as an index, so user space co= de > must refer to a specific index to access a kernel shared variable. > Let's take a quick look to the KPI/API for exporting/importing kernel > shared variables. > Say we want implement a routine to export an int from the kernel. > To define the variable to be exported inside the kernel you would use >=20 > KSVAR_DEFINE(0, int, test_value); >=20 > You have just defined an int variable named "test_value" at index 0. > Inside the kernel you can write/read as usual using the symbol test_value; > Now you likely want add to libc a function callable from user processes > that return the test_value variable. So first of all you need the import = the > variable. >=20 > KSVAR_IMPORT(0, int, test_value); >=20 > and to obtain a pointer to read the value you would use >=20 > KSVAR(test_value); >=20 > so your function would look like something like this >=20 > int get_test_value() > { >=20 > return (*KSVAR(test_value)); > } >=20 > Then inside your process just call get_test_value() function as you usual= ly > do and you'll get a kernel written value without switching in kernel mode. >=20 > Let's see now in more detail how that could be accomplished. > The shared variables will be accessed as normal variables and are read/wr= ite > inside the kernel. The variables need to be inside the same page(s) and n= othing > but the shared variables (and the table) must be into the page(s). To > obtain that > I changed the linker script in this way >=20 > --- a/sys/conf/ldscript.amd64 > +++ b/sys/conf/ldscript.amd64 > @@ -177,6 +177,15 @@ SECTIONS > *(.ldata .ldata.* .gnu.linkonce.l.*) > . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1); > } > + .ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) : > + { > + __ksvar_set_start =3D .; > + *(.ksvar_table) > + *(.ksvar) > + > + . =3D ALIGN(CONSTANT (COMMONPAGESIZE)); > + __ksvar_set_stop =3D .; > + } > . =3D ALIGN(64 / 8); > _end =3D .; PROVIDE (end =3D .); > . =3D DATA_SEGMENT_END (.); >=20 > When we want to define a variable in the kernel to share with user space > we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h >=20 > +struct ksvar_set { > + uint32_t idx; > + char *pksvar; > +}; > + > +/* > + * Declare a variable into kernel shared linker_set. > + */ > +#define KSVAR_DEFINE(index, type, name) \ > + static type name __section(".ksvar"); \ > + static struct ksvar_set name ## _ksvar_set =3D { \ > + .idx =3D index, \ > + .pksvar =3D (char *) &name \ > + }; \ > + DATA_SET(ksvar_set, name ## _ksvar_set) >=20 > Every variable must have a unique index. The indexes must > start from zero and be consecutive. When you add an index > you must bump the size of the table (KSVAR_TABLE_SIZE) > (see sys/sys/ksvar.h) >=20 > The variables are inside the kernel static image that isn't managed > by the VM and so we need to allocate pages to map the physical addresses. > A new SYSINIT (ksvarinit) will allocate a set of vm_page_t through > the vm_phys_fictitious_reg_range interface and fill the table using > the information > of the ksvar_set linker set, then will create a vm_object_t (vm_object_ks= var), > mark the fake pages as valid and put them into it. > When a new process is created by exec(3) the vm_object_ksvar will be > mapped read-only into the process address space by vm_map_fixed routine > just before mapping the user stack. The address of mapping will be record= ed > inside the new p_ksvar field of the struct proc. > This field will be exported through a sysctl to the user space processes. > In order to implement syscalls as user space routines, we have to find ou= t the > mapped address of the kernel shared variables when the libc is mapped into > the process. So I added a function marked with the attribute constructor. > It will called before any code into user process and before any code insi= de > the libc. >=20 > +__attribute((constructor)) void init_kernel_shared() > +{ > + int mib[2]; > + size_t len; > + vm_offset_t ksvar_address; > + > + mib[0] =3D CTL_KERN; > + mib[1] =3D KERN_KSVAR; > + len =3D sizeof(vm_offset_t); > + if (__sysctl(mib, 2, (void *) &ksvar_address, &len, NULL, 0) !=3D= -1) > + ksvar_table =3D (uint32_t *) ksvar_address; > +} >=20 > Once the libc knows the address of the table it can access to the shared > variables. >=20 > Just as proof of concept I re-implemented gettimeofday(3) in user space. > First of all I didn't remove the entry into the syscall.master, just rena= med the > sys_gettimeofday. I need it for the fallback path. > In the kernel I introduced a struct wall_clock. >=20 > +struct wall_clock > +{ > + struct timeval tv; > + struct timezone tz; > +}; >=20 > The struct is exported through sys/sys/time.h header. > I defined a new kernel shared variable. To do so I added an index in > sys/sys/ksvar.h > WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1. > In the sys/kern/kern_clocksource.c >=20 > +/* kernel shared variable for implmenting gettimeofday. */ > +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock); >=20 > Now we defined a shared variable at index WALL_CLOCK_INDEX of type > struct wall_clock and named wall_clock. > Inside handleevents I update the info exported by wall_clock. >=20 > + struct timeval tv; > + > + /* update time for userspace gettimeofday */ > + microtime(&tv); > + wall_clock.tv =3D tv; > + wall_clock.tz.tz_minuteswest =3D tz_minuteswest; > + wall_clock.tz.tz_dsttime =3D tz_dsttime; >=20 > Now, in libc we import the shared variable >=20 > +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock); >=20 > note that WALL_CLOCK_INDEX must be the same of the one defined > inside the kernel, and define a new function gettimeofday >=20 > +int > +gettimeofday(struct timeval *tp, struct timezone *tzp) > +{ > + > + /* fallback to syscall if kernel doesn't export ksvar */ > + if (!KSVAR_IS_ACTIVE()) > + return (sys_gettimeofday(tp, tzp)); > + > + if (tp !=3D NULL) > + *tp =3D KSVAR(wall_clock)->tv; > + if (tzp !=3D NULL) > + *tzp =3D KSVAR(wall_clock)->tz; > + return (0); > +} >=20 > Now when a process will call getimeofday, will call that function actuall= y. > If the process makes a lot of call to gettimeofday, we will see a > performance boost. > Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE), > the function > fallback to call the actual syscall (sys_gettimeofday). >=20 > Open tasks > - implement support for 32-bit emulated processes running in a 64-bit > environment. > - extend support to others arch > - implement more syscalls > - benchmarks > - Test, test, test. >=20 > I'm looking forward to hear about your comments and suggestions. I very much dislike what you described, it makes ABI maintanence a nightmare. Below is some mail I wrote around Spring 2009, making some notes about desired proposal. This is what called vdso in Linux land. On Tue, Mar 31, 2009 at 04:04:46PM +0200, Giuseppe Cocomazzi wrote: > Gentle kib, > I've understood what you mean to do: you said me to implement the=20 > syscall trampoline as a dynamic shared object to be copied by the kernel= =20 > in every process shared page; then we would eventually pass the shared=20 > page address to the rtld using a AT_SYSINFO_EHDR. During program=20 > startup, if this is found by the dyn linker, we define a symbol=20 > containing that previously obtained address, which libc could easily > access for its own syscall wrappers. Ok? Is this your idea? Or didn't I= =20 > get it at all? > Now, if I got what you meant, let me explain my already done work on the= =20 > syscall trampoline. > My approach does not make use of any dso: the kernel just copies a=20 > little piece of code in the syscall trampoline shared page: > popl %ecx > int $0x80 > pushl %ecx > without any symbol. This would be changed in its sysenter counterpart by= =20 > cpu_startup, in case the SEP bit is set. A sysctl has been created in=20 > order to let user programs obtain that sc trampoline address. > Crt has been patched to retrieve the address by means of the sysctl at=20 > run time and then puts this address in a global symbol named 'sctramp'. > (I want you to know that the sysctl mechanism could be simply avoided if= =20 > we decide to have syscall shared pages at a fixed address: actually=20 > they are mapped to maxsaddr. This need to be discussed later, but is not= =20 > the hot point, now.) > The 'sctramp' symbol is accessed by libc wrappers to enter the kernel=20 > when issuing system calls: > #define KERNCALL call *%sctramp > What I want to say is the following: I think your approach is the same=20 > as mine, in that the rtld has to load a shared object in any case, being= =20 > it crt or a custom dso. But since crt is already there, why do we need=20 > to create another .so when we already have one which is however linked=20 > in the process address space? Think of crt as your custom dso, and=20 > you'll get the picture. > Maybe, your approach is more elegant than mine, though mine is more=20 > minimal and less invasive (no symbols to take into account but one,=20 > etc). Furthermore, don't understimate the fact that I've already coded=20 > and tested it: I attach a copy of the whole patch, so that you can have= =20 > a look at it, accompanied by a little explaining paper. Don't waste your= =20 > time in reviewing the kernel part (this is rookie's task), concentrate=20 > on the user space part. > Hope I don't cause a waste of your precious time, > Regards Crt is not dso. It is the stub that got linked statically to most binaries. The actual mechanism you implemented is _ortohonal_ to decision of having shared page as a dso. That dso shall not be used to provide any "addresses" to libc. Libc syscall stabs shall call some functional symbol, that is defined strong in the dso, and weak in the rtld. Rtld implementation shall be int0x80. Rtld shall preload the dso, assuming the aux entry supplied by the kernel contains phdr address of the object. Features that gives us the dso: 1. Absolute freedom in the layout of the page. 2. Page may implement several entries, among them are - syscall (that is what you described above); - gettimeofday with optimized implementation (see long threads about TSC, APCI HPET etc); - getpid - machine-optimized copy routines like memcpy, strcpy and so on. - signal trampoline code (see #5 below, why this is _very_ desirable). - ... (there I have stopped my imagination) 3. Addition of new symbols does not require any changes to libc to activate them, because the standard behaviour of the dynamic linker gives the priority to the symbols from the preloaded objects over the symbols from the dependencies. 4. Dso gives the right place for the CFI to be found by debuggers and exception propagation code (CFI stands for Call Frame Information, it is used to allow the stack unwinding to properly restore frames and registers). amd64 already suffers from the lack of CFI on signal trampolines and sysenter wrappers. Bare shared page is ugly from this point of view. Need for CFI was one of the main motivation for the dso on Linux. 5. Putting signal trampoline into the shared page instead of top of the stack would be a great step into enabling NX bit for the stack. 6. Linuxolator would get the vdso too, that is big deficiency in it now. As you see, list of the items that are desirable on the shared page is quite long, and having fixed format is the large problem for binary compatibility. --SRK8lRENmpuaYFQC Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/JGXkACgkQC3+MBN1Mb4iqcQCeKC+6UcscqSD0AkKnVu1QPiTu VrUAoI0hxz1U92+l9Ka0acuRJXg42AV5 =8QvG -----END PGP SIGNATURE----- --SRK8lRENmpuaYFQC-- From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 21:21:55 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CB19D106564A; Fri, 1 Jun 2012 21:21:55 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au [211.29.132.187]) by mx1.freebsd.org (Postfix) with ESMTP id 2BAE98FC0C; Fri, 1 Jun 2012 21:21:55 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q51LLjQW002570 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 2 Jun 2012 07:21:46 +1000 Date: Sat, 2 Jun 2012 07:21:45 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Giovanni Trematerra In-Reply-To: Message-ID: <20120602044306.S4049@besplex.bde.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , alc@freebsd.org, Konstantin Belousov , Alexander Kabaev , freebsd-arch@freebsd.org Subject: Re: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2012 21:21:55 -0000 On Fri, 1 Jun 2012, Giovanni Trematerra wrote: > I'd like to discuss a way to provide a mechanism to share some read-only > data between kernel and user space programs avoiding syscall overhead, > implementing some them, such as gettimeofday(3) and time(3) as ordinary > user space routine. This is particularly unsuitable for implementing gettimeofday(), since for it to work you would need to use approximately 1 CPU spinning in the kernel to update the time every microsecond. For time(3), it only needs a relatively slow update. For clock_gettime() with nansoeconds precision, it is even more unsuitable. For clock_gettime() with precisions between 1 second and 1 microseconds, it is intermediately unsuitable. It also requires some complications for locking/atomicity and coherency (much the same as in the kernel. Not just for times. For times, the kernel handles locking/atomicity fairly well, and coherency fairly badly. > The patch at > http://www.trematerra.net/patches/ksvar_experimental.patch > > is in a very experimental stage. It's just a proof-of-concept. > Only works for an AMD64 kernel and only for 64-bit applications. > The idea is to have all the variables that we want to share between kernel > and user space into one or more consecutive pages of memory that will be > mapped read-only into every running process. At the start of the first > shared page > there'll be a table with as many entries as the number of the shared variables. > Each entry is a 32-bit value that is the offset between the start of the shared > page and the start of the variable in the page. The user space processes need > to find out the map address of shared page and use the table to access to the > shared variables. On amd64, 2 32-bit values or 64-bit values with most bits 0 or 1 can be packed/encoded into 1 64-bit value to give a certain atomicity without locking. The corresponding i386 packing into 1 32-bit value doesn't work so well. > ... > Just as proof of concept I re-implemented gettimeofday(3) in user space. > First of all I didn't remove the entry into the syscall.master, just renamed the > sys_gettimeofday. I need it for the fallback path. > In the kernel I introduced a struct wall_clock. > > +struct wall_clock > +{ > + struct timeval tv; > + struct timezone tz; > +}; This is much larger than 64 bits. struct timezone is relatively unimportant. struct timeval is bloated on amd64 (128 bits), but can be packed into 64 bits (works for a few hundred years). On i386, it could be packed into 20 bits for tv_usec and 12 bits for an offset for tv_sec. > Now we defined a shared variable at index WALL_CLOCK_INDEX of type > struct wall_clock and named wall_clock. > Inside handleevents I update the info exported by wall_clock. > > + struct timeval tv; > + > + /* update time for userspace gettimeofday */ > + microtime(&tv); This is supposed to have no races (microtime() uses a generation counter to recover from them), and gives the necessary microseconds precision, provided it is called every microsecond. This doesn't quite require a CPU spinning to update. You could also unmap the variable 1 microsecond after it is accessed and take a pagefault to update it for the next access. > + wall_clock.tv = tv; This has races (when userland accesses it in the middle of the kernel update). > + wall_clock.tz.tz_minuteswest = tz_minuteswest; > + wall_clock.tz.tz_dsttime = tz_dsttime; This has races, but rarely changes. > +int > +gettimeofday(struct timeval *tp, struct timezone *tzp) > +{ > + > + /* fallback to syscall if kernel doesn't export ksvar */ > + if (!KSVAR_IS_ACTIVE()) > + return (sys_gettimeofday(tp, tzp)); > + > + if (tp != NULL) > + *tp = KSVAR(wall_clock)->tv; Races. These can be fixed using a generation counter as in the kernel, but it is much harder since we are not in control of the update times and the updates must occur much more frequently. In the kernel, the critical updates occur only every 1-10 msec (except when timers are stopped). This gives an offset, and a difference is added to this, with atomicity for the difference given even naturally by the hardware or by locking the hardware. Then generation count changes only every 1-10 msec, and itself is updated only by implicit serialization instructions. No ordering is enforced for stores and loads, but the implicit serializations have 1-10 msec to flush the stores and loads, in any order provided that they happen during this time, for the generation count to work. > + if (tzp != NULL) > + *tzp = KSVAR(wall_clock)->tz; Minor races. > + return (0); > +} I don't see how to implement gettimeofday() in userland without duplicating most of the kernel's timecounter code. The kernel can keep an offset, updated every 1-10 msec. Userland calls the hardware timecounter and scales it delicately as in the kernel, but with even more complications to synchronize with ntpd micro-adjustments and other threads and timecounter hardware. > Now when a process will call getimeofday, will call that function actually. > If the process makes a lot of call to gettimeofday, we will see a > performance boost. Without the above problems being solved, we would see it not working precisely when it is used a lot. It would report time differences of 0 (or occasionally garbage from races) for events separated by hundreds of thousands of nanoseconds. gettimeofday()'s precision and accuracy are well established, so there must be many programs that depend on them. The newer clock_gettime() interfaces give more choice and must be used if sub-microsecond precision is wanted and sub-microsecond accuracy is hoped for (with normal X86 hardware, an accuracy of ~10 nanosecond is nearly possible for small differences in relative times but not for absolute times). But again, the low precision clock ids aren't very useful, since if you actually use them enough to notice their slowness, then they won't distinguish different events. These ids are especially bogus when clock_gettime() is implemented as a syscall, since the syscall overhead is about 90% of the total overhead provided the timecounter hardware is efficient (clock_getttime(CLOCK_MONOTONIC, ...) might take 300 nsec, while clock_gettime(CLOCK_MONOTONIC_FAST_N_BROKEN, ...) takes 330 nsec. Then you may as well use precise version). But if clock_gettime(CLOCK_MONOTONIC, ...) is a syscall as it needs to be for good precision, while clock_gettime(CLOCK_MONOTONIC_FAST_N_BROKEN, ...) is memory mapped, then the latter becomes almost useful. BTW, the FAST_N_BROKEN clock ids advertise bogus precision in clock_getres(), so you can't tell (except from their names) how much worse their precision is than that of the unbroken versions. (clock_getres() is bogusly named (POSIX standard), since it returns the precision (the resolution is 1 nsec for all timespec interfaces). POSIX clearly doesn't intend for clock_getres() to actually return the resolution, since that is known at compile time and parts of the specification of clock_getres() require precision semantics. Many parts are confused in other ways, at least in 2001 draft7, and are not implemented in FreeBSD. E.g., clock_settime() is specified to truncate the time to a multiple of the "resolution" returned by clock_getres() with the same id. This more or less assumes that the clock is in hardware ticks, with no micro-adjustments. FreeBSD does micro-adjustments and delicate scaling of hardware clocks, with a low-level resolution of 2**-64 seconds, and doesn't truncate the time in clock_settime(). File times are more interesting, since they often have to be truncated to fit on disks; POSIX only started specifying the details of this about 4 years ago. In this mail, I try to use "precision" consistently instead of "resolution", except where describing POSIX bugs.) For old clock ids, FreeBSD returns something reasonable (the hardware or virtual clock update period, rounded up), except for CLOCK_PROF and CLOCK_VIRTUAL, it returns the period of the unrelated hz clock (the actual clock is an impossible-to-express combination of the stathz clock for sampling the user+sys decomposition, the cputick clock for the total time, and these limited by the resolution of the getrusage()/calcru() timeval-based API, 1 usec for the tiemval resolution is probably best on fast machine, but I use 1/stathz seconds.) For newer clock ids, FreeBSD the precision of the precise clock for all the non-precise clocks (the latter have a virtual clock update period of tc_tick/hz seconds, so they should advertise that as their precision). CLOCK_THREAD_CPUTIME_ID uses the cputicker as its basic clock, but rounds to usec, so its precision is an integral multiple of 1000 nsec. The scaling for this is quite different and broken than that for the old clock ids. It rounds down instead of up, and then rounds up to 1000 nsec iff the previous value was 0. The first scaling step has the nonsense factor of 1000000 instead of 1000000000, and the result is accidentally correct in the usual case where the cputicker frequency is < 1000000, else garbage (not a multiple of 1000, and usually too small): % int % kern_clock_getres(struct thread *td, clockid_t clock_id, struct timespec *ts) % { % % ts->tv_sec = 0; % switch (clock_id) { % case CLOCK_REALTIME: % case CLOCK_REALTIME_FAST: % case CLOCK_REALTIME_PRECISE: % case CLOCK_MONOTONIC: % case CLOCK_MONOTONIC_FAST: % case CLOCK_MONOTONIC_PRECISE: % case CLOCK_UPTIME: % case CLOCK_UPTIME_FAST: % case CLOCK_UPTIME_PRECISE: Most of these shouldn't be here (or anywhere). % /* % * Round up the result of the division cheaply by adding 1. % * Rounding up is especially important if rounding down % * would give 0. Perfect rounding is unimportant. % */ % ts->tv_nsec = 1000000000 / tc_getfrequency() + 1; Correct value for CLOCK_REALTIME and CLOCK_MONOTONIC. Also for unportable aliases of these, but those shouldn't exist. % break; % case CLOCK_VIRTUAL: % case CLOCK_PROF: % /* Accurately round up here because we can do so cheaply. */ % ts->tv_nsec = (1000000000 + hz - 1) / hz; Should use stathz ore maybe just the resolution of a timeval (except when the precision is limited by the cputicker more than by timevals, use the former limit). % break; % case CLOCK_SECOND: % ts->tv_sec = 1; % ts->tv_nsec = 0; % break; Correct. Unlike most cases, clock_gettime() with this returns an integral multiple of the precision. For hardware clock ids with precise hardware, we don't really know the precision and don't want to round to it (it will be a small number of nsec, perhaps 0, plus a fraction of a nanosec, and we don't want to round everything based on these fractions). But for software clock ids, the resolution will be about 1-10msec and rounding to a multiple of 1 or 10 msec would be good if that is the update frequency (it will often be 1/hz with hz a nice multiple of 10, except for inaccuracies in the interrupt frequency). % case CLOCK_THREAD_CPUTIME_ID: % /* sync with cputick2usec */ % ts->tv_nsec = 1000000 / cpu_tickrate(); % if (ts->tv_nsec == 0) % ts->tv_nsec = 1000; % break; Nonsense scaling. When cpu_tickrate() <= 1000000, then result is too small by a factor of 1000 and often invalid since it is not a multiple of 1000. But on most or all supported arches, cpu_tickrate() is > 1000000 so the result of the division is 0 and this is fixed up to the correct value of 1000. The correct scaling is more like the above: - use the same scale factor as above. Here we are scaling to nsec, so it is almost irrelvant that cputick2usec() scales to usec. We should just use 1000 from cputick2usec()'s limit on the precision in the (usual) case that that limit is stricter (higher) than the one got by correct scaling here - when scaling to nsec, round up as above, not down as this does now - when scaling to nsec, round up efficiently as above, not with branching logic as this does now. But we probably need the branching logic to clamp up to 1000. Note that clock_gettime(CLOCK_THREAD_CPUTIME_ID, ...) scales to usec just to use old KPIs which only support timevals, although clock_gettime() uses timespecs. % default: % return (EINVAL); % } % return (0); % } Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 21:42:41 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4ADA7106566B; Fri, 1 Jun 2012 21:42:41 +0000 (UTC) (envelope-from giovanni.trematerra@gmail.com) Received: from mail-qa0-f47.google.com (mail-qa0-f47.google.com [209.85.216.47]) by mx1.freebsd.org (Postfix) with ESMTP id B57408FC15; Fri, 1 Jun 2012 21:42:40 +0000 (UTC) Received: by qabg1 with SMTP id g1so677010qab.13 for ; Fri, 01 Jun 2012 14:42:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=1b+jNglZtbxjmC/T2jxHqzQUByHT4QP3SQaqaclj0LA=; b=UxZZFTPtRZLOfjplU5Vter32S+ZMyTi1K5bFSf7LvXxeIjwz6RHrv16Oduj/vg8AaE dyVvGfAtQ1DMoV49dyldbiqVBaVDFWv2QWzCR26Fu2T6debRcfTy7dvjVoVn2vVl4xZs 4ae6n2ff3Yo1y2DBNqn7V6iztPTVEUNZ35w9MvYYrXGmT2MRqYl8Ixed3E2u49/c259R fuVHJn/q0ySp5Rnuv23nQvaODt0eraUn1WgpVhR94WfT3j/YwWaZSaeHLnAypulpJePP ZF3b6gjlmx7enPjSDAtLzxRj6yr2c13G5yacQz8SVHnXf7Nb1UGW6FYK3EBoF/+fD0vb DZaQ== MIME-Version: 1.0 Received: by 10.224.181.134 with SMTP id by6mr5853096qab.56.1338586954256; Fri, 01 Jun 2012 14:42:34 -0700 (PDT) Received: by 10.229.160.20 with HTTP; Fri, 1 Jun 2012 14:42:34 -0700 (PDT) In-Reply-To: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> References: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> Date: Fri, 1 Jun 2012 23:42:34 +0200 Message-ID: From: Giovanni Trematerra To: Konstantin Belousov Content-Type: text/plain; charset=ISO-8859-1 Cc: Attilio Rao , alc@freebsd.org, Alexander Kabaev , freebsd-arch@freebsd.org Subject: Re: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2012 21:42:41 -0000 On Fri, Jun 1, 2012 at 9:35 PM, Konstantin Belousov wrote: > On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote: >> Hello, >> I'd like to discuss a way to provide a mechanism to share some read-only >> data between kernel and user space programs avoiding syscall overhead, >> implementing some them, such as gettimeofday(3) and time(3) as ordinary >> user space routine. >> >> The patch at >> http://www.trematerra.net/patches/ksvar_experimental.patch >> > > I very much dislike what you described, it makes ABI maintanence > a nightmare. While I respect your decision to dislike my work, could you please give me some concrete examples of ABI maintenance nightmare? I mean not based on speculation. > Below is some mail I wrote around Spring 2009, making some notes about > desired proposal. I wonder why your proposal isn't on the Ideas Page wiki. By the way, is this proposal valid? http://wiki.freebsd.org/IdeasPage#Avoiding_syscall_overhead_.28GSoC.29 > This is what called vdso in Linux land. I know what vdso is in Linux land but while implementing vdso will give us some additional features in any case it needs a mechanism like the one I implemented (ksvar) to access kernel data while in user space and I think my implementation isn't too much different from what is called VVAR in linux parlance. Please take a look at http://fxr.watson.org/fxr/source/arch/x86/include/asm/vvar.h?v=linux-2.6 http://fxr.watson.org/fxr/source/arch/x86/vdso/vclock_gettime.c?v=linux-2.6;im=10 Could you please review the kernel part of the patch? -- Gianni From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 22:23:43 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BDEB01065670; Fri, 1 Jun 2012 22:23:43 +0000 (UTC) (envelope-from giovanni.trematerra@gmail.com) Received: from mail-qa0-f47.google.com (mail-qa0-f47.google.com [209.85.216.47]) by mx1.freebsd.org (Postfix) with ESMTP id 1D5158FC12; Fri, 1 Jun 2012 22:23:42 +0000 (UTC) Received: by qabg1 with SMTP id g1so692044qab.13 for ; Fri, 01 Jun 2012 15:23:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=XEteMbH8tH0s92292qHmaUoLnGd0r/RCe99bVApaSaU=; b=Lu4IaRzeZdb76tjcTW3tulB2eQtIWCarzwabDYjO62yuvC+0Tw9MPrspwxwTtPKTwk ABcM/lO1QRx/BoaVMXx+5okHFgRT6VLVoEVPjut5Q1fTawFbiP4e2xbwPfShwcEp0oJw KJXm4PduB/9zvB8IqllSf7qItJoADLZM/f75+EzY3z61U/mTnwoB5uL6migsH1k1AU3Y Ev8k5kiboo633CZX3KlfPGT2K8Dxmn3CTD6HStoPaQPxtEbhdiH3oYSaJ2vOstOv74iD dofNu2JhW4FsAwCOOsqg+tAPPO3jOnIqc6+eXyoe/ErlvOvo93ZV3tv32leBlXzJFltH CNeA== MIME-Version: 1.0 Received: by 10.224.187.147 with SMTP id cw19mr6012337qab.47.1338589422333; Fri, 01 Jun 2012 15:23:42 -0700 (PDT) Received: by 10.229.160.20 with HTTP; Fri, 1 Jun 2012 15:23:42 -0700 (PDT) In-Reply-To: <20120602044306.S4049@besplex.bde.org> References: <20120602044306.S4049@besplex.bde.org> Date: Sat, 2 Jun 2012 00:23:42 +0200 Message-ID: From: Giovanni Trematerra To: Bruce Evans Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Attilio Rao , alc@freebsd.org, Konstantin Belousov , Alexander Kabaev , freebsd-arch@freebsd.org Subject: Re: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2012 22:23:43 -0000 On Fri, Jun 1, 2012 at 11:21 PM, Bruce Evans wrote: > On Fri, 1 Jun 2012, Giovanni Trematerra wrote: > >> I'd like to discuss a way to provide a mechanism to share some read-only >> data between kernel and user space programs avoiding syscall overhead, >> implementing some them, such as gettimeofday(3) and time(3) as ordinary >> user space routine. > > > This is particularly unsuitable for implementing gettimeofday(), since fo= r > it to work you would need to use approximately 1 CPU spinning in the > kernel to update the time every microsecond. =A0For time(3), it only need= s > a relatively slow update. =A0For clock_gettime() with nansoeconds precisi= on, > it is even more unsuitable. =A0For clock_gettime() with precisions betwee= n > 1 second and 1 microseconds, it is intermediately unsuitable. > > It also requires some complications for locking/atomicity and coherency > (much the same as in the kernel. =A0Not just for times. =A0For times, the > kernel handles locking/atomicity fairly well, and coherency fairly badly. > Well, the primary intend of the patch is to provide a mechanism to share da= ta between kernel and user land without switching in kernel mode. Not to provi= de a complete re-implementation in user mode of all time stuff. > >> The patch at >> http://www.trematerra.net/patches/ksvar_experimental.patch >> >> is in a very experimental stage. It's just a proof-of-concept. >> Only works for an AMD64 kernel and only for 64-bit applications. >> The idea is to have all the variables that we want to share between kern= el >> and user space into one or more consecutive pages of memory that will be >> mapped read-only into every running process. At the start of the first >> shared page >> there'll be a table with as many entries as the number of the shared >> variables. >> Each entry is a 32-bit value that is the offset between the start of the >> shared >> page and the start of the variable in the page. The user space processes >> need >> to find out the map address of shared page and use the table to access t= o >> the >> shared variables. > > > On amd64, 2 32-bit values or 64-bit values with most bits 0 or 1 can be > packed/encoded into 1 64-bit value to give a certain atomicity without > locking. =A0The corresponding i386 packing into 1 32-bit value doesn't wo= rk > so well. These values are written just one time during a SYSINIT routine and are onl= y read by user processes. > >> ... > > >> Just as proof of concept I re-implemented gettimeofday(3) in user space. >> First of all I didn't remove the entry into the syscall.master, just >> renamed the >> sys_gettimeofday. I need it for the fallback path. >> In the kernel I introduced a struct wall_clock. >> >> +struct wall_clock >> +{ >> + =A0 =A0 =A0 struct timeval =A0tv; >> + =A0 =A0 =A0 struct timezone tz; >> +}; > > > This is much larger than 64 bits. =A0struct timezone is relatively > unimportant. > struct timeval is bloated on amd64 (128 bits), but can be packed into 64 > bits (works for a few hundred years). =A0On i386, it could be packed into > 20 bits for tv_usec and 12 bits for an offset for tv_sec. > Thanks a lot for your explanation. I think they will be precious as a refer= ence. Nonetheless I just wrote gettimeofday in that way just as proof-of-concept, just to show how things could be supposed to work, it didn't mean to be cor= rect. I think it was just unfortunate to have choose gettimeofday. I'm most interested in the VM things of the patch. -- Gianni From owner-freebsd-arch@FreeBSD.ORG Sat Jun 2 00:11:29 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BE66B106564A; Sat, 2 Jun 2012 00:11:29 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) by mx1.freebsd.org (Postfix) with ESMTP id 4D5FF8FC24; Sat, 2 Jun 2012 00:11:26 +0000 (UTC) Received: from JRE-MBP-2.local (c-67-180-24-15.hsd1.ca.comcast.net [67.180.24.15]) (authenticated bits=0) by vps1.elischer.org (8.14.5/8.14.5) with ESMTP id q520B1QL038993 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Fri, 1 Jun 2012 17:11:03 -0700 (PDT) (envelope-from julian@freebsd.org) Message-ID: <4FC95A10.7000806@freebsd.org> Date: Fri, 01 Jun 2012 17:10:56 -0700 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 MIME-Version: 1.0 To: Bryan Drewery References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no> <20120517055425.GA802@infradead.org> <4FC762DD.90101@FreeBSD.org> <4FC81D9C.2080801@FreeBSD.org> <4FC8E29F.2010806@shatow.net> In-Reply-To: <4FC8E29F.2010806@shatow.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: =?ISO-8859-1?Q?Dag-Erling_Sm=F8?=@freebsd.org, Adrian Chadd , Doug Barton , d@delphij.net, Andriy Gapon , Eitan Adler , freebsd-arch@freebsd.org, rgrav Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged process? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jun 2012 00:11:29 -0000 On 6/1/12 8:41 AM, Bryan Drewery wrote: > On 5/31/2012 8:40 PM, Doug Barton wrote: >> On 5/31/2012 5:23 AM, Andriy Gapon wrote: >>> In fact, FreeBSD also has this rlimit and there seems to be full support for it on >>> both user and kernel sides. >>> OTOH, PRIV_VM_MLOCK privilege seems to be granted only to the super-user in the >>> default configuration. And this privilege kind of defeats the limit. >>> >>> Perhaps, we should/could kill the privilege and set the limit to a sufficiently >>> small/safe value for ordinary users? >> I like this idea, but someone else in the thread (sorry, don't have it >> handy) brought up the point that we don't want the aggregate of per-user >> limits to be able to bring down the system either. So the right solution >> would seem to be a reasonable per-user limit, and a cap on the maximum >> total amount of locked pages for all unprivileged users, probably based >> on some percentage of total available memory? >> >> Doug >> > I like this approach. A per-user ulimit, and a global max sysctl that > can be overridden, but by default based on a percentage of available memory. I'd go a different route. I'd have it inherited, and I'd have the value be 0 by default, but settable to some different value at login.conf, or by an ancestor with root privs. From owner-freebsd-arch@FreeBSD.ORG Sat Jun 2 11:03:51 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 27D56106576A; Sat, 2 Jun 2012 11:03:51 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 0A3038FC16; Sat, 2 Jun 2012 11:03:46 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id OAA23522; Sat, 02 Jun 2012 14:03:44 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1Sam7U-000ILr-Bu; Sat, 02 Jun 2012 14:03:44 +0300 Message-ID: <4FC9F30E.4030205@FreeBSD.org> Date: Sat, 02 Jun 2012 14:03:42 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20120503 Thunderbird/12.0.1 MIME-Version: 1.0 To: Doug Barton References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no> <20120517055425.GA802@infradead.org> <4FC762DD.90101@FreeBSD.org> <4FC81D9C.2080801@FreeBSD.org> In-Reply-To: <4FC81D9C.2080801@FreeBSD.org> X-Enigmail-Version: 1.5pre Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: =?ISO-8859-1?Q?Dag-Erling_Sm=F8?=@FreeBSD.org, Adrian Chadd , d@delphij.net, Eitan Adler , freebsd-arch@FreeBSD.org, rgrav Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged process? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jun 2012 11:03:51 -0000 on 01/06/2012 04:40 Doug Barton said the following: > On 5/31/2012 5:23 AM, Andriy Gapon wrote: >> In fact, FreeBSD also has this rlimit and there seems to be full support for it on >> both user and kernel sides. >> OTOH, PRIV_VM_MLOCK privilege seems to be granted only to the super-user in the >> default configuration. And this privilege kind of defeats the limit. >> >> Perhaps, we should/could kill the privilege and set the limit to a sufficiently >> small/safe value for ordinary users? > > I like this idea, but someone else in the thread (sorry, don't have it > handy) brought up the point that we don't want the aggregate of per-user > limits to be able to bring down the system either. The unprivileged users can not spawn any new users on their own and there is a limit on number of processes per user, so a system administrator should be able to plan resource limits based on system capacity and utilization. > So the right solution > would seem to be a reasonable per-user limit, and a cap on the maximum > total amount of locked pages for all unprivileged users, probably based > on some percentage of total available memory? I would agree for a default limit of zero even. As long as I (as a system administrator) am able to increase it for selected users and groups. -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Sat Jun 2 11:30:24 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 74216106566C; Sat, 2 Jun 2012 11:30:24 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 9F94B8FC0C; Sat, 2 Jun 2012 11:30:22 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id OAA23639; Sat, 02 Jun 2012 14:30:20 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1SamXE-000INm-Ht; Sat, 02 Jun 2012 14:30:20 +0300 Message-ID: <4FC9F94B.8060708@FreeBSD.org> Date: Sat, 02 Jun 2012 14:30:19 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20120503 Thunderbird/12.0.1 MIME-Version: 1.0 To: Julian Elischer References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no> <20120517055425.GA802@infradead.org> <4FC762DD.90101@FreeBSD.org> <4FC81D9C.2080801@FreeBSD.org> <4FC8E29F.2010806@shatow.net> <4FC95A10.7000806@freebsd.org> In-Reply-To: <4FC95A10.7000806@freebsd.org> X-Enigmail-Version: 1.5pre Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Adrian Chadd , Doug Barton , d@delphij.net, Eitan Adler , freebsd-arch@FreeBSD.org, =?ISO-8859-1?Q?Dag-Erling_Sm=F8rgrav?= , Edward Tomasz Napierala , Bryan Drewery Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged process? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jun 2012 11:30:24 -0000 on 02/06/2012 03:10 Julian Elischer said the following: > I'd go a different route. > I'd have it inherited, and I'd have the value be 0 by default, but settable to > some different value at login.conf, or by an ancestor with root privs. I think that this is how the limits work in general :-) I agree that defaulting the limit to 0 for non-privileged users is a good idea, at least at the beginning. For super-user we might want to keep the limit uncapped. Some further technical observations: o I was overly optimistic about _full_ support for RLIMIT_MEMLOCK - mlockall() doesn't support itat the moment and I am not sure if it is easy to implement the support for the MCL_FUTURE case. o Currently the default class in default login.conf has memorylocked=unlimited - not very smart. o There is also vm.max_wired sysctl (with no equivalent tunable), which specifies number of _pages_ that can be wired system wide (by both kernel and userland). But note that the limit applies only to userland requests, the kernel is allowed to wire new pages even when the limit is exceeded. By default the limit is set to 1/3 of available pages. So watch out for this limit when using ZFS, ZFS can easily starve userland. o I've just discovered :-) that we also have RCTL/RACCT framework (not enabled by default) aka "Resource Accounting" / "Resource Limits", which seems to parallel the conventional limits in many categories including the locked memory. Not sure why we have that and if the interactions between conventional limits, resource limits and privileges would be easy to untangle. o A general observation that our way of setting resource limits via login classes (login.conf) seems to be inferior to limits.conf way of Linux. More about the last point. In addition to the traditional users and groups we also have login classes. Initial (conventional) resource limits can be set only via login.conf, i.e. via the classes. The classes can be assigned only in master.passwd and thus only to users. So if I want to increase some limit for a group, then I have to create (and maintain) a parallel class and assign that class to all users in question. Now imagine a case of a user being in several groups. Ability to specify the limits on per-user/per-group basis like it is done with Linux limits.conf seems to be more convenient. The new rctl framework also allows to set resource limits for "process, user, login class, or jail". 'group' is missing from the list. -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Sat Jun 2 13:01:43 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C0CD7106566B; Sat, 2 Jun 2012 13:01:43 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id B0E898FC08; Sat, 2 Jun 2012 13:01:42 +0000 (UTC) Received: by laai10 with SMTP id i10so2736144laa.13 for ; Sat, 02 Jun 2012 06:01:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=qklQ6e84wildO7Fn6SCjujHT7KqOIGiFlwqTcBYNTK8=; b=XEBf4tPOr6vMdKjutq8EhV/LdcSvdG1SZhcACmtCNj0XBZFZmFdbmtcJPg83b/jfuZ 4fBv1VYvZcbXdiaUCS0EYtLQ9m2brtz9mvQrJjwIvgFffVE20E/iVDDREQ00Y2XsWBYc PukzFEZ2F0/DPGm9Ok5FVi0ht6EiPZifWBVAgxB6JWKSo03xQ75+SdU4/dr71Fe8Joxn 9EL5MWnrjXIAJGyYO3pHw2FaFWzVG6S+hwDg9MsHEqk8iUzNTXx3aEbjqmjVo1x+u7CQ kbtHAVEMTnJCXfQMF4VeuOjCj4BonRIGJt4iUTmjOe+NtyyQ2bt0J8iWtIttX+1T81Ya nsGA== MIME-Version: 1.0 Received: by 10.152.135.105 with SMTP id pr9mr6535166lab.37.1338642095871; Sat, 02 Jun 2012 06:01:35 -0700 (PDT) Sender: asmrookie@gmail.com Received: by 10.112.27.65 with HTTP; Sat, 2 Jun 2012 06:01:35 -0700 (PDT) In-Reply-To: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> References: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> Date: Sat, 2 Jun 2012 14:01:35 +0100 X-Google-Sender-Auth: RlMy4owDf64wdOP4xglAbXaJ6ao Message-ID: From: Attilio Rao To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: alc@freebsd.org, Alexander Kabaev , Giovanni Trematerra , freebsd-arch@freebsd.org Subject: Re: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jun 2012 13:01:43 -0000 2012/6/1 Konstantin Belousov : > On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote: >> Hello, >> I'd like to discuss a way to provide a mechanism to share some read-only >> data between kernel and user space programs avoiding syscall overhead, >> implementing some them, such as gettimeofday(3) and time(3) as ordinary >> user space routine. >> >> The patch at >> http://www.trematerra.net/patches/ksvar_experimental.patch >> >> is in a very experimental stage. It's just a proof-of-concept. >> Only works for an AMD64 kernel and only for 64-bit applications. >> The idea is to have all the variables that we want to share between kern= el >> and user space into one or more consecutive pages of memory that will be >> mapped read-only into every running process. At the start of the first >> shared page >> there'll be a table with as many entries as the number of the shared var= iables. >> Each entry is a 32-bit value that is the offset between the start of the= shared >> page and the start of the variable in the page. The user space processes= need >> to find out the map address of shared page and use the table to access t= o the >> shared variables. >> Kernel will export a variable to user space as an index, so user space c= ode >> must refer to a specific index to access a kernel shared variable. >> Let's take a quick look to the KPI/API for exporting/importing kernel >> shared variables. >> Say we want implement a routine to export an int from the kernel. >> To define the variable to be exported inside the kernel you would use >> >> KSVAR_DEFINE(0, int, test_value); >> >> You have just defined an int variable named "test_value" at index 0. >> Inside the kernel you can write/read as usual using the symbol test_valu= e; >> Now you likely want add to libc a function callable from user processes >> that return the test_value variable. So first of all you need the import= the >> variable. >> >> KSVAR_IMPORT(0, int, test_value); >> >> and to obtain a pointer to read the value you would use >> >> KSVAR(test_value); >> >> so your function would look like something like this >> >> int get_test_value() >> { >> >> =C2=A0 =C2=A0 =C2=A0return (*KSVAR(test_value)); >> } >> >> Then inside your process just call get_test_value() function as you usua= lly >> do and you'll get a kernel written value without switching in kernel mod= e. >> >> Let's see now in more detail how that could be accomplished. >> The shared variables will be accessed as normal variables and are read/w= rite >> inside the kernel. The variables need to be inside the same page(s) and = nothing >> but the shared variables (and the table) must be into the page(s). To >> obtain that >> I changed the linker script in this way >> >> --- a/sys/conf/ldscript.amd64 >> +++ b/sys/conf/ldscript.amd64 >> @@ -177,6 +177,15 @@ SECTIONS >> =C2=A0 =C2=A0 *(.ldata .ldata.* .gnu.linkonce.l.*) >> =C2=A0 =C2=A0 . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1); >> =C2=A0 } >> + =C2=A0.ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) : >> + =C2=A0{ >> + =C2=A0 =C2=A0__ksvar_set_start =3D .; >> + =C2=A0 =C2=A0*(.ksvar_table) >> + =C2=A0 =C2=A0*(.ksvar) >> + >> + =C2=A0 . =3D ALIGN(CONSTANT (COMMONPAGESIZE)); >> + =C2=A0 __ksvar_set_stop =3D .; >> + =C2=A0} >> =C2=A0 . =3D ALIGN(64 / 8); >> =C2=A0 _end =3D .; PROVIDE (end =3D .); >> =C2=A0 . =3D DATA_SEGMENT_END (.); >> >> When we want to define a variable in the kernel to share with user space >> we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h >> >> +struct ksvar_set { >> + =C2=A0 =C2=A0 =C2=A0 uint32_t idx; >> + =C2=A0 =C2=A0 =C2=A0 char *pksvar; >> +}; >> + >> +/* >> + * Declare a variable into kernel shared linker_set. >> + */ >> +#define =C2=A0 =C2=A0 =C2=A0 =C2=A0KSVAR_DEFINE(index, type, name) \ >> + =C2=A0 =C2=A0 =C2=A0 static type name __section(".ksvar"); =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \ >> + =C2=A0 =C2=A0 =C2=A0 static struct ksvar_set name ## _ksvar_set =3D { = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\ >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .idx =3D index, =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \ >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .pksvar =3D (char *) = &name =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0\ >> + =C2=A0 =C2=A0 =C2=A0 }; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\ >> + =C2=A0 =C2=A0 =C2=A0 DATA_SET(ksvar_set, name ## _ksvar_set) >> >> Every variable must have a unique index. The indexes must >> start from zero and be consecutive. When you add an index >> you must bump the size of the table (KSVAR_TABLE_SIZE) >> (see sys/sys/ksvar.h) >> >> The variables are inside the kernel static image that isn't managed >> by the VM and so we need to allocate pages to map the physical addresses= . >> A new SYSINIT (ksvarinit) will allocate a set of vm_page_t =C2=A0through >> the vm_phys_fictitious_reg_range interface and fill the table using >> the information >> of the ksvar_set linker set, then will create a vm_object_t (vm_object_k= svar), >> mark the fake pages as valid and put them into it. >> When a new process is created by exec(3) the vm_object_ksvar will be >> mapped read-only into the process address space by vm_map_fixed routine >> just before mapping the user stack. The address of mapping will be recor= ded >> inside the new p_ksvar field of the struct proc. >> This field will be exported through a sysctl to the user space processes= . >> In order to implement syscalls as user space routines, we have to find o= ut the >> mapped address of the kernel shared variables when the libc is mapped in= to >> the process. So I added a function marked with the attribute constructor= . >> It will called before any code into user process and before any code ins= ide >> the libc. >> >> +__attribute((constructor)) void init_kernel_shared() >> +{ >> + =C2=A0 =C2=A0 =C2=A0 int mib[2]; >> + =C2=A0 =C2=A0 =C2=A0 size_t len; >> + =C2=A0 =C2=A0 =C2=A0 vm_offset_t ksvar_address; >> + >> + =C2=A0 =C2=A0 =C2=A0 mib[0] =3D CTL_KERN; >> + =C2=A0 =C2=A0 =C2=A0 mib[1] =3D KERN_KSVAR; >> + =C2=A0 =C2=A0 =C2=A0 len =3D sizeof(vm_offset_t); >> + =C2=A0 =C2=A0 =C2=A0 if (__sysctl(mib, 2, (void *) &ksvar_address, &le= n, NULL, 0) !=3D -1) >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ksvar_table =3D (uint= 32_t *) ksvar_address; >> +} >> >> Once the libc knows the address of the table it can access to the shared >> variables. >> >> Just as proof of concept I re-implemented gettimeofday(3) in user space. >> First of all I didn't remove the entry into the syscall.master, just ren= amed the >> sys_gettimeofday. I need it for the fallback path. >> In the kernel I introduced a struct wall_clock. >> >> +struct wall_clock >> +{ >> + =C2=A0 =C2=A0 =C2=A0 struct timeval =C2=A0tv; >> + =C2=A0 =C2=A0 =C2=A0 struct timezone tz; >> +}; >> >> The struct is exported through sys/sys/time.h header. >> I defined a new kernel shared variable. To do so I added an index in >> sys/sys/ksvar.h >> WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1. >> In the sys/kern/kern_clocksource.c >> >> +/* kernel shared variable for implmenting gettimeofday. */ >> +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock); >> >> Now we defined a shared variable at index WALL_CLOCK_INDEX of type >> struct wall_clock and named wall_clock. >> Inside handleevents I update the info exported by wall_clock. >> >> + =C2=A0 =C2=A0 =C2=A0 struct timeval tv; >> + >> + =C2=A0 =C2=A0 =C2=A0 /* update time for userspace gettimeofday */ >> + =C2=A0 =C2=A0 =C2=A0 microtime(&tv); >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tv =3D tv; >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_minuteswest =3D tz_minuteswest; >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_dsttime =3D tz_dsttime; >> >> Now, in libc we import the shared variable >> >> +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock); >> >> note that WALL_CLOCK_INDEX must be the same of the one defined >> inside the kernel, and define a new function gettimeofday >> >> +int >> +gettimeofday(struct timeval *tp, struct timezone *tzp) >> +{ >> + >> + =C2=A0 =C2=A0 =C2=A0 /* fallback to syscall if kernel doesn't export k= svar */ >> + =C2=A0 =C2=A0 =C2=A0 if (!KSVAR_IS_ACTIVE()) >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return (sys_gettimeof= day(tp, tzp)); >> + >> + =C2=A0 =C2=A0 =C2=A0 if (tp !=3D NULL) >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tp =3D KSVAR(wall_cl= ock)->tv; >> + =C2=A0 =C2=A0 =C2=A0 if (tzp !=3D NULL) >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tzp =3D KSVAR(wall_c= lock)->tz; >> + =C2=A0 =C2=A0 =C2=A0 return (0); >> +} >> >> Now when a process will call getimeofday, will call that function actual= ly. >> If the process makes a lot of call to gettimeofday, we will see a >> performance boost. >> Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE), >> the function >> fallback to call the actual syscall (sys_gettimeofday). >> >> Open tasks >> - implement support for 32-bit emulated processes running in a 64-bit >> environment. >> - extend support to others arch >> - implement more syscalls >> - benchmarks >> - Test, test, test. >> >> I'm looking forward to hear about your comments and suggestions. > > I very much dislike what you described, it makes ABI maintanence > a nightmare. > Below is some mail I wrote around Spring 2009, making some notes about > desired proposal. This is what called vdso in Linux land. Did you bother to read at least Giovanni's description? Because this has nothing to do with VDSO in Linux. I think, he just wants to map in userland processes some pages from the static image of the kernel (packed together in a specific dataset). This imposes some non-trivial problem. The first thing is that the static image is not thought to have physical pages tied to it. The second is that he needs to make a clean design in order to let consumer of this mechanism to correctly locate informations they want within the shared page(s) and in the end read the correct values. I have some reservations on both the implementation and the approach for retrieving datas from the page. In particular, I don't like that a new vm_object is allocated for this page. What I really would like would be: 1) very minimal implementation -- you just use pmap_enter()/pmap_remove() specifically when needed, separately, in fork(), execve(), etc. cases 2) more complete approach -- you make a very quick layer which let you map pages from the static image of the kernel and the shared page becomes just a specific consumer of this. This way the object has much more sense because it becomes an object associated to all the static image of the kernel About the layering, I don't like that you require both a kernel and userland header to locate the objects within the page. This is very likely ABI breakage prone. It is needed a mechanism for retrieving at run time what Giovanni calls "indexes", or making it indexes-agnostic. Attilio --=20 Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Sat Jun 2 16:49:09 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E34B01065676; Sat, 2 Jun 2012 16:49:09 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 1F3BE8FC0A; Sat, 2 Jun 2012 16:49:07 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q52GmlRo059366; Sat, 2 Jun 2012 19:48:47 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q52GmlPU074939; Sat, 2 Jun 2012 19:48:47 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q52GmlkT074938; Sat, 2 Jun 2012 19:48:47 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 2 Jun 2012 19:48:47 +0300 From: Konstantin Belousov To: Attilio Rao Message-ID: <20120602164847.GB2358@deviant.kiev.zoral.com.ua> References: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="gQmjQz8lQ7hwrZL9" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: alc@freebsd.org, Alexander Kabaev , Giovanni Trematerra , freebsd-arch@freebsd.org Subject: Re: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jun 2012 16:49:10 -0000 --gQmjQz8lQ7hwrZL9 Content-Type: text/plain; charset=koi8-r Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote: > 2012/6/1 Konstantin Belousov : > > On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote: > >> Hello, > >> I'd like to discuss a way to provide a mechanism to share some read-on= ly > >> data between kernel and user space programs avoiding syscall overhead, > >> implementing some them, such as gettimeofday(3) and time(3) as ordinary > >> user space routine. > >> > >> The patch at > >> http://www.trematerra.net/patches/ksvar_experimental.patch > >> > >> is in a very experimental stage. It's just a proof-of-concept. > >> Only works for an AMD64 kernel and only for 64-bit applications. > >> The idea is to have all the variables that we want to share between ke= rnel > >> and user space into one or more consecutive pages of memory that will = be > >> mapped read-only into every running process. At the start of the first > >> shared page > >> there'll be a table with as many entries as the number of the shared v= ariables. > >> Each entry is a 32-bit value that is the offset between the start of t= he shared > >> page and the start of the variable in the page. The user space process= es need > >> to find out the map address of shared page and use the table to access= to the > >> shared variables. > >> Kernel will export a variable to user space as an index, so user space= code > >> must refer to a specific index to access a kernel shared variable. > >> Let's take a quick look to the KPI/API for exporting/importing kernel > >> shared variables. > >> Say we want implement a routine to export an int from the kernel. > >> To define the variable to be exported inside the kernel you would use > >> > >> KSVAR_DEFINE(0, int, test_value); > >> > >> You have just defined an int variable named "test_value" at index 0. > >> Inside the kernel you can write/read as usual using the symbol test_va= lue; > >> Now you likely want add to libc a function callable from user processes > >> that return the test_value variable. So first of all you need the impo= rt the > >> variable. > >> > >> KSVAR_IMPORT(0, int, test_value); > >> > >> and to obtain a pointer to read the value you would use > >> > >> KSVAR(test_value); > >> > >> so your function would look like something like this > >> > >> int get_test_value() > >> { > >> > >> =9A =9A =9Areturn (*KSVAR(test_value)); > >> } > >> > >> Then inside your process just call get_test_value() function as you us= ually > >> do and you'll get a kernel written value without switching in kernel m= ode. > >> > >> Let's see now in more detail how that could be accomplished. > >> The shared variables will be accessed as normal variables and are read= /write > >> inside the kernel. The variables need to be inside the same page(s) an= d nothing > >> but the shared variables (and the table) must be into the page(s). To > >> obtain that > >> I changed the linker script in this way > >> > >> --- a/sys/conf/ldscript.amd64 > >> +++ b/sys/conf/ldscript.amd64 > >> @@ -177,6 +177,15 @@ SECTIONS > >> =9A =9A *(.ldata .ldata.* .gnu.linkonce.l.*) > >> =9A =9A . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1); > >> =9A } > >> + =9A.ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) : > >> + =9A{ > >> + =9A =9A__ksvar_set_start =3D .; > >> + =9A =9A*(.ksvar_table) > >> + =9A =9A*(.ksvar) > >> + > >> + =9A . =3D ALIGN(CONSTANT (COMMONPAGESIZE)); > >> + =9A __ksvar_set_stop =3D .; > >> + =9A} > >> =9A . =3D ALIGN(64 / 8); > >> =9A _end =3D .; PROVIDE (end =3D .); > >> =9A . =3D DATA_SEGMENT_END (.); > >> > >> When we want to define a variable in the kernel to share with user spa= ce > >> we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h > >> > >> +struct ksvar_set { > >> + =9A =9A =9A uint32_t idx; > >> + =9A =9A =9A char *pksvar; > >> +}; > >> + > >> +/* > >> + * Declare a variable into kernel shared linker_set. > >> + */ > >> +#define =9A =9A =9A =9AKSVAR_DEFINE(index, type, name) \ > >> + =9A =9A =9A static type name __section(".ksvar"); =9A =9A =9A =9A = =9A =9A =9A =9A =9A \ > >> + =9A =9A =9A static struct ksvar_set name ## _ksvar_set =3D { =9A =9A= =9A =9A =9A\ > >> + =9A =9A =9A =9A =9A =9A =9A .idx =3D index, =9A =9A =9A =9A =9A =9A = =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A \ > >> + =9A =9A =9A =9A =9A =9A =9A .pksvar =3D (char *) &name =9A =9A =9A = =9A =9A =9A =9A =9A =9A =9A =9A =9A\ > >> + =9A =9A =9A }; =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A = =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A\ > >> + =9A =9A =9A DATA_SET(ksvar_set, name ## _ksvar_set) > >> > >> Every variable must have a unique index. The indexes must > >> start from zero and be consecutive. When you add an index > >> you must bump the size of the table (KSVAR_TABLE_SIZE) > >> (see sys/sys/ksvar.h) > >> > >> The variables are inside the kernel static image that isn't managed > >> by the VM and so we need to allocate pages to map the physical address= es. > >> A new SYSINIT (ksvarinit) will allocate a set of vm_page_t =9Athrough > >> the vm_phys_fictitious_reg_range interface and fill the table using > >> the information > >> of the ksvar_set linker set, then will create a vm_object_t (vm_object= _ksvar), > >> mark the fake pages as valid and put them into it. > >> When a new process is created by exec(3) the vm_object_ksvar will be > >> mapped read-only into the process address space by vm_map_fixed routine > >> just before mapping the user stack. The address of mapping will be rec= orded > >> inside the new p_ksvar field of the struct proc. > >> This field will be exported through a sysctl to the user space process= es. > >> In order to implement syscalls as user space routines, we have to find= out the > >> mapped address of the kernel shared variables when the libc is mapped = into > >> the process. So I added a function marked with the attribute construct= or. > >> It will called before any code into user process and before any code i= nside > >> the libc. > >> > >> +__attribute((constructor)) void init_kernel_shared() > >> +{ > >> + =9A =9A =9A int mib[2]; > >> + =9A =9A =9A size_t len; > >> + =9A =9A =9A vm_offset_t ksvar_address; > >> + > >> + =9A =9A =9A mib[0] =3D CTL_KERN; > >> + =9A =9A =9A mib[1] =3D KERN_KSVAR; > >> + =9A =9A =9A len =3D sizeof(vm_offset_t); > >> + =9A =9A =9A if (__sysctl(mib, 2, (void *) &ksvar_address, &len, NULL= , 0) !=3D -1) > >> + =9A =9A =9A =9A =9A =9A =9A ksvar_table =3D (uint32_t *) ksvar_addre= ss; > >> +} > >> > >> Once the libc knows the address of the table it can access to the shar= ed > >> variables. > >> > >> Just as proof of concept I re-implemented gettimeofday(3) in user spac= e. > >> First of all I didn't remove the entry into the syscall.master, just r= enamed the > >> sys_gettimeofday. I need it for the fallback path. > >> In the kernel I introduced a struct wall_clock. > >> > >> +struct wall_clock > >> +{ > >> + =9A =9A =9A struct timeval =9Atv; > >> + =9A =9A =9A struct timezone tz; > >> +}; > >> > >> The struct is exported through sys/sys/time.h header. > >> I defined a new kernel shared variable. To do so I added an index in > >> sys/sys/ksvar.h > >> WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1. > >> In the sys/kern/kern_clocksource.c > >> > >> +/* kernel shared variable for implmenting gettimeofday. */ > >> +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock); > >> > >> Now we defined a shared variable at index WALL_CLOCK_INDEX of type > >> struct wall_clock and named wall_clock. > >> Inside handleevents I update the info exported by wall_clock. > >> > >> + =9A =9A =9A struct timeval tv; > >> + > >> + =9A =9A =9A /* update time for userspace gettimeofday */ > >> + =9A =9A =9A microtime(&tv); > >> + =9A =9A =9A wall_clock.tv =3D tv; > >> + =9A =9A =9A wall_clock.tz.tz_minuteswest =3D tz_minuteswest; > >> + =9A =9A =9A wall_clock.tz.tz_dsttime =3D tz_dsttime; > >> > >> Now, in libc we import the shared variable > >> > >> +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock); > >> > >> note that WALL_CLOCK_INDEX must be the same of the one defined > >> inside the kernel, and define a new function gettimeofday > >> > >> +int > >> +gettimeofday(struct timeval *tp, struct timezone *tzp) > >> +{ > >> + > >> + =9A =9A =9A /* fallback to syscall if kernel doesn't export ksvar */ > >> + =9A =9A =9A if (!KSVAR_IS_ACTIVE()) > >> + =9A =9A =9A =9A =9A =9A =9A return (sys_gettimeofday(tp, tzp)); > >> + > >> + =9A =9A =9A if (tp !=3D NULL) > >> + =9A =9A =9A =9A =9A =9A =9A *tp =3D KSVAR(wall_clock)->tv; > >> + =9A =9A =9A if (tzp !=3D NULL) > >> + =9A =9A =9A =9A =9A =9A =9A *tzp =3D KSVAR(wall_clock)->tz; > >> + =9A =9A =9A return (0); > >> +} > >> > >> Now when a process will call getimeofday, will call that function actu= ally. > >> If the process makes a lot of call to gettimeofday, we will see a > >> performance boost. > >> Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE), > >> the function > >> fallback to call the actual syscall (sys_gettimeofday). > >> > >> Open tasks > >> - implement support for 32-bit emulated processes running in a 64-bit > >> environment. > >> - extend support to others arch > >> - implement more syscalls > >> - benchmarks > >> - Test, test, test. > >> > >> I'm looking forward to hear about your comments and suggestions. > > > > I very much dislike what you described, it makes ABI maintanence > > a nightmare. > > Below is some mail I wrote around Spring 2009, making some notes about > > desired proposal. This is what called vdso in Linux land. >=20 > Did you bother to read at least Giovanni's description? > Because this has nothing to do with VDSO in Linux. Did you bothered to think shortly why do I object ? >=20 > I think, he just wants to map in userland processes some pages from > the static image of the kernel (packed together in a specific > dataset). This imposes some non-trivial problem. The first thing is > that the static image is not thought to have physical pages tied to > it. The second is that he needs to make a clean design in order to let > consumer of this mechanism to correctly locate informations they want > within the shared page(s) and in the end read the correct values. Right, exactly, and this is why I object to the "offsets" approach. It basically moves us to the old times of the "jump tables" shared libraries, that fortunately was never a case for FreeBSD even when a.out was used. >=20 > I have some reservations on both the implementation and the approach > for retrieving datas from the page. > In particular, I don't like that a new vm_object is allocated for this > page. What I really would like would be: > 1) very minimal implementation -- you just use > pmap_enter()/pmap_remove() specifically when needed, separately, in > fork(), execve(), etc. cases Oh, this simply cannot work. > 2) more complete approach -- you make a very quick layer which let you > map pages from the static image of the kernel and the shared page > becomes just a specific consumer of this. This way the object has much > more sense because it becomes an object associated to all the static > image of the kernel So you want to circumvent the vm layer. >=20 > About the layering, I don't like that you require both a kernel and > userland header to locate the objects within the page. This is very > likely ABI breakage prone. It is needed a mechanism for retrieving at > run time what Giovanni calls "indexes", or making it indexes-agnostic. And this is what VDSO is for. VDSO with the standard ELF symbol interposition rules allow to have libc that is completely unaware of the shared page and 'indexes', i.e. which works both for older kernel that do not export required index, and for new kernels that export the same information in some more advanced format. By having VDSO that exports e.g. gettimeofday() we would get override for libc gettimeofday, while having fully functional libc for other, future and past, kernels, even if the format of the data exported for super-fast gettimeofday changes. The tight between VDSO and kernel is not a problem, since VDSO is part of the kernel from the deployment POV. More. either existing ELF linker in kernel, or some trivial modifications to it, would allow to not use 'indexes' on the kernel side too. We already have a shared page between kernel and whole set of the same-ABI processes. Currently it is used for signal trampolines only. The hard parts of the task is to provide VDSO build glue. Also IMO the hard task is to define sensible gettimeofday() implementation, probably using rdtsc in usermode. Shared page is easy, or at least it is already there without ugly and non-working vm hacks. As an additional note, already put by Bruce, the implementation of usermode gettimeofday is exactly opposite of any reasonable implementation. It looses the precision to the frequency of the event timer. Obvious approach is to not have any periodically updating data for gettimeofday purpose, and use some formula with rdtsc and kernel-provided coefficients on the machines where rdtsc is usable. Interesting question is how much shared the shared page needs be. Obvious needs are shared between all same-ABI processes, but I can also easily see a need for the per-process private information be present in the 'private-shared' page. For silly but typical example, useful for moronix-style benchmarks, see getpid(). Shrug. --gQmjQz8lQ7hwrZL9 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/KQ+8ACgkQC3+MBN1Mb4g2NACgkLX/iLA3GzLGxP81Orzy+X7G GVEAoIuyoHDauMOErYp+wNLxNWZp5vBF =gT67 -----END PGP SIGNATURE----- --gQmjQz8lQ7hwrZL9-- From owner-freebsd-arch@FreeBSD.ORG Sat Jun 2 17:00:09 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E67B4106564A; Sat, 2 Jun 2012 17:00:08 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id AE3298FC18; Sat, 2 Jun 2012 17:00:07 +0000 (UTC) Received: by laai10 with SMTP id i10so2823846laa.13 for ; Sat, 02 Jun 2012 10:00:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; bh=kTgNyiKz8flKgtATHJ/9fmR9G2Pd/ZSOjw0m3AUVvQQ=; b=zdzkBwlKOmni4ta3rf9Y6A3FqUJBlyBreqeKnCKGsogRVSZ9uamJ07LPh5h9YwylTQ fFcy+UG7DghtNLzajGSaoBuPnjU25XcBS+iue8qaKXkSOaZ8f20uDivBI1olm02aNxmf J7eKIUiE5czQbqfOloaCMl2PS5TKogg+m9t911LOkNWK8KRUBBTreDx1x7e/Ysp87Oqi s4TH0FT357S42AEbvkgg8jP+CJAknCfRSjH6ccDc7JmBx/Sxpl4y0CCttWpl3czLhM+9 ZbEncDquAacDefbjlvyK5hsafbIpFuGwdRfVKbpbCewtyyeZT5QJkJ7cqgbiT7cwxQy8 gF7w== MIME-Version: 1.0 Received: by 10.112.45.4 with SMTP id i4mr3701338lbm.79.1338656406504; Sat, 02 Jun 2012 10:00:06 -0700 (PDT) Sender: asmrookie@gmail.com Received: by 10.112.27.65 with HTTP; Sat, 2 Jun 2012 10:00:06 -0700 (PDT) In-Reply-To: References: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> <20120602164847.GB2358@deviant.kiev.zoral.com.ua> Date: Sat, 2 Jun 2012 18:00:06 +0100 X-Google-Sender-Auth: sMCBm15RYm0QSB4Z016r5COjoKo Message-ID: From: Attilio Rao To: freebsd-arch@freebsd.org, Gianni , Alexander Kabaev , Alan Cox , Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: Subject: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jun 2012 17:00:09 -0000 Sorry, resending with all the recipients in. Attilio ---------- Forwarded message ---------- From: Attilio Rao Date: 2012/6/2 Subject: Re: [RFC] Kernel shared variables To: Konstantin Belousov 2012/6/2 Konstantin Belousov : > On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote: >> 2012/6/1 Konstantin Belousov : >> > On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote: >> >> Hello, >> >> I'd like to discuss a way to provide a mechanism to share some read-o= nly >> >> data between kernel and user space programs avoiding syscall overhead= , >> >> implementing some them, such as gettimeofday(3) and time(3) as ordina= ry >> >> user space routine. >> >> >> >> The patch at >> >> http://www.trematerra.net/patches/ksvar_experimental.patch >> >> >> >> is in a very experimental stage. It's just a proof-of-concept. >> >> Only works for an AMD64 kernel and only for 64-bit applications. >> >> The idea is to have all the variables that we want to share between k= ernel >> >> and user space into one or more consecutive pages of memory that will= be >> >> mapped read-only into every running process. At the start of the firs= t >> >> shared page >> >> there'll be a table with as many entries as the number of the shared = variables. >> >> Each entry is a 32-bit value that is the offset between the start of = the shared >> >> page and the start of the variable in the page. The user space proces= ses need >> >> to find out the map address of shared page and use the table to acces= s to the >> >> shared variables. >> >> Kernel will export a variable to user space as an index, so user spac= e code >> >> must refer to a specific index to access a kernel shared variable. >> >> Let's take a quick look to the KPI/API for exporting/importing kernel >> >> shared variables. >> >> Say we want implement a routine to export an int from the kernel. >> >> To define the variable to be exported inside the kernel you would use >> >> >> >> KSVAR_DEFINE(0, int, test_value); >> >> >> >> You have just defined an int variable named "test_value" at index 0. >> >> Inside the kernel you can write/read as usual using the symbol test_v= alue; >> >> Now you likely want add to libc a function callable from user process= es >> >> that return the test_value variable. So first of all you need the imp= ort the >> >> variable. >> >> >> >> KSVAR_IMPORT(0, int, test_value); >> >> >> >> and to obtain a pointer to read the value you would use >> >> >> >> KSVAR(test_value); >> >> >> >> so your function would look like something like this >> >> >> >> int get_test_value() >> >> { >> >> >> >> =C2=A0 =C2=A0 =C2=A0return (*KSVAR(test_value)); >> >> } >> >> >> >> Then inside your process just call get_test_value() function as you u= sually >> >> do and you'll get a kernel written value without switching in kernel = mode. >> >> >> >> Let's see now in more detail how that could be accomplished. >> >> The shared variables will be accessed as normal variables and are rea= d/write >> >> inside the kernel. The variables need to be inside the same page(s) a= nd nothing >> >> but the shared variables (and the table) must be into the page(s). To >> >> obtain that >> >> I changed the linker script in this way >> >> >> >> --- a/sys/conf/ldscript.amd64 >> >> +++ b/sys/conf/ldscript.amd64 >> >> @@ -177,6 +177,15 @@ SECTIONS >> >> =C2=A0 =C2=A0 *(.ldata .ldata.* .gnu.linkonce.l.*) >> >> =C2=A0 =C2=A0 . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1); >> >> =C2=A0 } >> >> + =C2=A0.ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) : >> >> + =C2=A0{ >> >> + =C2=A0 =C2=A0__ksvar_set_start =3D .; >> >> + =C2=A0 =C2=A0*(.ksvar_table) >> >> + =C2=A0 =C2=A0*(.ksvar) >> >> + >> >> + =C2=A0 . =3D ALIGN(CONSTANT (COMMONPAGESIZE)); >> >> + =C2=A0 __ksvar_set_stop =3D .; >> >> + =C2=A0} >> >> =C2=A0 . =3D ALIGN(64 / 8); >> >> =C2=A0 _end =3D .; PROVIDE (end =3D .); >> >> =C2=A0 . =3D DATA_SEGMENT_END (.); >> >> >> >> When we want to define a variable in the kernel to share with user sp= ace >> >> we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h >> >> >> >> +struct ksvar_set { >> >> + =C2=A0 =C2=A0 =C2=A0 uint32_t idx; >> >> + =C2=A0 =C2=A0 =C2=A0 char *pksvar; >> >> +}; >> >> + >> >> +/* >> >> + * Declare a variable into kernel shared linker_set. >> >> + */ >> >> +#define =C2=A0 =C2=A0 =C2=A0 =C2=A0KSVAR_DEFINE(index, type, name) \ >> >> + =C2=A0 =C2=A0 =C2=A0 static type name __section(".ksvar"); =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \ >> >> + =C2=A0 =C2=A0 =C2=A0 static struct ksvar_set name ## _ksvar_set =3D= { =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\ >> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .idx =3D index, = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \ >> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .pksvar =3D (char = *) &name =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0\ >> >> + =C2=A0 =C2=A0 =C2=A0 }; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\ >> >> + =C2=A0 =C2=A0 =C2=A0 DATA_SET(ksvar_set, name ## _ksvar_set) >> >> >> >> Every variable must have a unique index. The indexes must >> >> start from zero and be consecutive. When you add an index >> >> you must bump the size of the table (KSVAR_TABLE_SIZE) >> >> (see sys/sys/ksvar.h) >> >> >> >> The variables are inside the kernel static image that isn't managed >> >> by the VM and so we need to allocate pages to map the physical addres= ses. >> >> A new SYSINIT (ksvarinit) will allocate a set of vm_page_t =C2=A0thro= ugh >> >> the vm_phys_fictitious_reg_range interface and fill the table using >> >> the information >> >> of the ksvar_set linker set, then will create a vm_object_t (vm_objec= t_ksvar), >> >> mark the fake pages as valid and put them into it. >> >> When a new process is created by exec(3) the vm_object_ksvar will be >> >> mapped read-only into the process address space by vm_map_fixed routi= ne >> >> just before mapping the user stack. The address of mapping will be re= corded >> >> inside the new p_ksvar field of the struct proc. >> >> This field will be exported through a sysctl to the user space proces= ses. >> >> In order to implement syscalls as user space routines, we have to fin= d out the >> >> mapped address of the kernel shared variables when the libc is mapped= into >> >> the process. So I added a function marked with the attribute construc= tor. >> >> It will called before any code into user process and before any code = inside >> >> the libc. >> >> >> >> +__attribute((constructor)) void init_kernel_shared() >> >> +{ >> >> + =C2=A0 =C2=A0 =C2=A0 int mib[2]; >> >> + =C2=A0 =C2=A0 =C2=A0 size_t len; >> >> + =C2=A0 =C2=A0 =C2=A0 vm_offset_t ksvar_address; >> >> + >> >> + =C2=A0 =C2=A0 =C2=A0 mib[0] =3D CTL_KERN; >> >> + =C2=A0 =C2=A0 =C2=A0 mib[1] =3D KERN_KSVAR; >> >> + =C2=A0 =C2=A0 =C2=A0 len =3D sizeof(vm_offset_t); >> >> + =C2=A0 =C2=A0 =C2=A0 if (__sysctl(mib, 2, (void *) &ksvar_address, = &len, NULL, 0) !=3D -1) >> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ksvar_table =3D (u= int32_t *) ksvar_address; >> >> +} >> >> >> >> Once the libc knows the address of the table it can access to the sha= red >> >> variables. >> >> >> >> Just as proof of concept I re-implemented gettimeofday(3) in user spa= ce. >> >> First of all I didn't remove the entry into the syscall.master, just = renamed the >> >> sys_gettimeofday. I need it for the fallback path. >> >> In the kernel I introduced a struct wall_clock. >> >> >> >> +struct wall_clock >> >> +{ >> >> + =C2=A0 =C2=A0 =C2=A0 struct timeval =C2=A0tv; >> >> + =C2=A0 =C2=A0 =C2=A0 struct timezone tz; >> >> +}; >> >> >> >> The struct is exported through sys/sys/time.h header. >> >> I defined a new kernel shared variable. To do so I added an index in >> >> sys/sys/ksvar.h >> >> WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1. >> >> In the sys/kern/kern_clocksource.c >> >> >> >> +/* kernel shared variable for implmenting gettimeofday. */ >> >> +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock); >> >> >> >> Now we defined a shared variable at index WALL_CLOCK_INDEX of type >> >> struct wall_clock and named wall_clock. >> >> Inside handleevents I update the info exported by wall_clock. >> >> >> >> + =C2=A0 =C2=A0 =C2=A0 struct timeval tv; >> >> + >> >> + =C2=A0 =C2=A0 =C2=A0 /* update time for userspace gettimeofday */ >> >> + =C2=A0 =C2=A0 =C2=A0 microtime(&tv); >> >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tv =3D tv; >> >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_minuteswest =3D tz_minuteswes= t; >> >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_dsttime =3D tz_dsttime; >> >> >> >> Now, in libc we import the shared variable >> >> >> >> +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock); >> >> >> >> note that WALL_CLOCK_INDEX must be the same of the one defined >> >> inside the kernel, and define a new function gettimeofday >> >> >> >> +int >> >> +gettimeofday(struct timeval *tp, struct timezone *tzp) >> >> +{ >> >> + >> >> + =C2=A0 =C2=A0 =C2=A0 /* fallback to syscall if kernel doesn't expor= t ksvar */ >> >> + =C2=A0 =C2=A0 =C2=A0 if (!KSVAR_IS_ACTIVE()) >> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return (sys_gettim= eofday(tp, tzp)); >> >> + >> >> + =C2=A0 =C2=A0 =C2=A0 if (tp !=3D NULL) >> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tp =3D KSVAR(wall= _clock)->tv; >> >> + =C2=A0 =C2=A0 =C2=A0 if (tzp !=3D NULL) >> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tzp =3D KSVAR(wal= l_clock)->tz; >> >> + =C2=A0 =C2=A0 =C2=A0 return (0); >> >> +} >> >> >> >> Now when a process will call getimeofday, will call that function act= ually. >> >> If the process makes a lot of call to gettimeofday, we will see a >> >> performance boost. >> >> Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE)= , >> >> the function >> >> fallback to call the actual syscall (sys_gettimeofday). >> >> >> >> Open tasks >> >> - implement support for 32-bit emulated processes running in a 64-bit >> >> environment. >> >> - extend support to others arch >> >> - implement more syscalls >> >> - benchmarks >> >> - Test, test, test. >> >> >> >> I'm looking forward to hear about your comments and suggestions. >> > >> > I very much dislike what you described, it makes ABI maintanence >> > a nightmare. >> > Below is some mail I wrote around Spring 2009, making some notes about >> > desired proposal. This is what called vdso in Linux land. >> >> Did you bother to read at least Giovanni's description? >> Because this has nothing to do with VDSO in Linux. > Did you bothered to think shortly why do I object ? > >> >> I think, he just wants to map in userland processes some pages from >> the static image of the kernel (packed together in a specific >> dataset). This imposes some non-trivial problem. The first thing is >> that the static image is not thought to have physical pages tied to >> it. The second is that he needs to make a clean design in order to let >> consumer of this mechanism to correctly locate informations they want >> within the shared page(s) and in the end read the correct values. > Right, exactly, and this is why I object to the "offsets" approach. > It basically moves us to the old times of the "jump tables" shared > libraries, that fortunately was never a case for FreeBSD even when > a.out was used. I'm objecting to this either. >> >> I have some reservations on both the implementation and the approach >> for retrieving datas from the page. >> In particular, I don't like that a new vm_object is allocated for this >> page. What I really would like would be: >> 1) very minimal implementation -- you just use >> pmap_enter()/pmap_remove() specifically when needed, separately, in >> fork(), execve(), etc. cases > Oh, this simply cannot work. And why? Assuming you provide a vm_page_t from an UMA zone just like fakepage do. Of course you cannot recycle for this purpose any page caming from vm_page_alloc(). >> 2) more complete approach -- you make a very quick layer which let you >> map pages from the static image of the kernel and the shared page >> becomes just a specific consumer of this. This way the object has much >> more sense because it becomes an object associated to all the static >> image of the kernel > So you want to circumvent the vm layer. Note sure I agree with your opinion on this. >> >> About the layering, I don't like that you require both a kernel and >> userland header to locate the objects within the page. This is very >> likely ABI breakage prone. It is needed a mechanism for retrieving at >> run time what Giovanni calls "indexes", or making it indexes-agnostic. > > And this is what VDSO is for. VDSO with the standard ELF symbol > interposition rules allow to have libc that is completely unaware of the > shared page and 'indexes', i.e. which works both for older kernel that > do not export required index, and for new kernels that export the same > information in some more advanced format. By having VDSO that exports > e.g. gettimeofday() we would get override for libc gettimeofday, while > having fully functional libc for other, future and past, kernels, even > if the format of the data exported for super-fast gettimeofday changes. > > The tight between VDSO and kernel is not a problem, since VDSO is part > of the kernel from the deployment POV. More. either existing ELF > linker in kernel, or some trivial modifications to it, would allow > to not use 'indexes' on the kernel side too. I admit I don't have a better plan on how to retrieve objects from the shared page at the moment, I didn't give much thought to it. > We already have a shared page between kernel and whole set of the same-AB= I > processes. Currently it is used for signal trampolines only. > The hard parts of the task is to provide VDSO build glue. Also IMO the > hard task is to define sensible gettimeofday() implementation, probably > using rdtsc in usermode. Shared page is easy, or at least it is already > there without ugly and non-working vm hacks. > > As an additional note, already put by Bruce, the implementation of > usermode gettimeofday is exactly opposite of any reasonable implementatio= n. > It looses the precision to the frequency of the event timer. Obvious > approach is to not have any periodically updating data for gettimeofday > purpose, and use some formula with rdtsc and kernel-provided coefficients > on the machines where rdtsc is usable. The gettimeofday() implementation is a different story than what is asked h= ere. > Interesting question is how much shared the shared page needs be. > Obvious needs are shared between all same-ABI processes, but I can also > easily see a need for the per-process private information be present in > the 'private-shared' page. For silly but typical example, useful for > moronix-style benchmarks, see getpid(). Really the performance benefits of having fast getpid() is marginal if compared to heavilly used things like gettimeofday(). I cannot think of a per-process page implementing a fast syscall that can bring many perfomance advantages. Attilio -- Peace can only be achieved by understanding - A. Einstein --=20 Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Sat Jun 2 17:16:37 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9D316106564A; Sat, 2 Jun 2012 17:16:37 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 083FF8FC1B; Sat, 2 Jun 2012 17:16:36 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q52HGXPx066108; Sat, 2 Jun 2012 20:16:33 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q52HGWLB075163; Sat, 2 Jun 2012 20:16:32 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q52HGWrW075162; Sat, 2 Jun 2012 20:16:32 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 2 Jun 2012 20:16:32 +0300 From: Konstantin Belousov To: Attilio Rao Message-ID: <20120602171632.GC2358@deviant.kiev.zoral.com.ua> References: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> <20120602164847.GB2358@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="0CHKT3anvf6u5QiQ" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: Alexander Kabaev , Alan Cox , Konstantin Belousov , Gianni , freebsd-arch@freebsd.org Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jun 2012 17:16:37 -0000 --0CHKT3anvf6u5QiQ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jun 02, 2012 at 06:00:06PM +0100, Attilio Rao wrote: > Sorry, resending with all the recipients in. >=20 > Attilio >=20 >=20 > ---------- Forwarded message ---------- > From: Attilio Rao > Date: 2012/6/2 > Subject: Re: [RFC] Kernel shared variables > To: Konstantin Belousov >=20 >=20 > 2012/6/2 Konstantin Belousov : > > On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote: [Tried to trim the text] > >> I think, he just wants to map in userland processes some pages from > >> the static image of the kernel (packed together in a specific > >> dataset). This imposes some non-trivial problem. The first thing is > >> that the static image is not thought to have physical pages tied to > >> it. The second is that he needs to make a clean design in order to let > >> consumer of this mechanism to correctly locate informations they want > >> within the shared page(s) and in the end read the correct values. > > Right, exactly, and this is why I object to the "offsets" approach. > > It basically moves us to the old times of the "jump tables" shared > > libraries, that fortunately was never a case for FreeBSD even when > > a.out was used. >=20 > I'm objecting to this either. My english is not good enough to understand this. Do you agree or disagree with my statement that 'indexes' make it very hard to maintain ABI ? >=20 > >> > >> I have some reservations on both the implementation and the approach > >> for retrieving datas from the page. > >> In particular, I don't like that a new vm_object is allocated for this > >> page. What I really would like would be: > >> 1) very minimal implementation -- you just use > >> pmap_enter()/pmap_remove() specifically when needed, separately, in > >> fork(), execve(), etc. cases > > Oh, this simply cannot work. >=20 > And why? Assuming you provide a vm_page_t from an UMA zone just like > fakepage do. Of course you cannot recycle for this purpose any page > caming from vm_page_alloc(). Due to pv_collect/pmap_pv_reclaim, the pte might be destroyed any time. Using hacks like mapping the page wired and then needing to hack any VM space manipulation (fork/rfork/exec/exit/swapout/I possibly missed several cases) just does not pay for it. >=20 > >> 2) more complete approach -- you make a very quick layer which let you > >> map pages from the static image of the kernel and the shared page > >> becomes just a specific consumer of this. This way the object has much > >> more sense because it becomes an object associated to all the static > >> image of the kernel > > So you want to circumvent the vm layer. >=20 > Note sure I agree with your opinion on this. >=20 > >> > >> About the layering, I don't like that you require both a kernel and > >> userland header to locate the objects within the page. This is very > >> likely ABI breakage prone. It is needed a mechanism for retrieving at > >> run time what Giovanni calls "indexes", or making it indexes-agnostic. > > > > And this is what VDSO is for. VDSO with the standard ELF symbol > > interposition rules allow to have libc that is completely unaware of the > > shared page and 'indexes', i.e. which works both for older kernel that > > do not export required index, and for new kernels that export the same > > information in some more advanced format. By having VDSO that exports > > e.g. gettimeofday() we would get override for libc gettimeofday, while > > having fully functional libc for other, future and past, kernels, even > > if the format of the data exported for super-fast gettimeofday changes. > > > > The tight between VDSO and kernel is not a problem, since VDSO is part > > of the kernel from the deployment POV. More. either existing ELF > > linker in kernel, or some trivial modifications to it, would allow > > to not use 'indexes' on the kernel side too. >=20 > I admit I don't have a better plan on how to retrieve objects from the > shared page at the moment, I didn't give much thought to it. >=20 > > We already have a shared page between kernel and whole set of the same-= ABI > > processes. Currently it is used for signal trampolines only. > > The hard parts of the task is to provide VDSO build glue. Also IMO the > > hard task is to define sensible gettimeofday() implementation, probably > > using rdtsc in usermode. Shared page is easy, or at least it is already > > there without ugly and non-working vm hacks. > > > > As an additional note, already put by Bruce, the implementation of > > usermode gettimeofday is exactly opposite of any reasonable implementat= ion. > > It looses the precision to the frequency of the event timer. Obvious > > approach is to not have any periodically updating data for gettimeofday > > purpose, and use some formula with rdtsc and kernel-provided coefficien= ts > > on the machines where rdtsc is usable. >=20 > The gettimeofday() implementation is a different story than what is asked= here. But the goal is to have fast clocks, right ? What else is planned ? In fact, I think that if the whole goal is only fast clocks, then we do not need any additional system mechanisms, since we can easily export coefficients for rdtsc formula already. E.g. we can put it into elf auxv, which is ugly but bearable. >=20 > > Interesting question is how much shared the shared page needs be. > > Obvious needs are shared between all same-ABI processes, but I can also > > easily see a need for the per-process private information be present in > > the 'private-shared' page. For silly but typical example, useful for > > moronix-style benchmarks, see getpid(). >=20 > Really the performance benefits of having fast getpid() is marginal if > compared to heavilly used things like gettimeofday(). I cannot think > of a per-process page implementing a fast syscall that can bring many > perfomance advantages. This is completely true, but there may be other process-private data that could benefit from the low access cost. I just do not know right now. --0CHKT3anvf6u5QiQ Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/KSnAACgkQC3+MBN1Mb4itVACg2xwUF4QRdToJDtqPRvRqaVUT AxwAoIx9JO6bedN2XFgQPWc/EqcAHFvv =sqUF -----END PGP SIGNATURE----- --0CHKT3anvf6u5QiQ-- From owner-freebsd-arch@FreeBSD.ORG Sat Jun 2 17:28:00 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D9FF91065677; Sat, 2 Jun 2012 17:28:00 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lb0-f182.google.com (mail-lb0-f182.google.com [209.85.217.182]) by mx1.freebsd.org (Postfix) with ESMTP id B96738FC25; Sat, 2 Jun 2012 17:27:59 +0000 (UTC) Received: by lbon10 with SMTP id n10so2985176lbo.13 for ; Sat, 02 Jun 2012 10:27:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=Kt/O+PHHOpLhrbzodt5P7CT7LFTo8FnmSBDFttgcJ+Q=; b=ExKpAs160ZiDxSTWwl+bnQ6GlYzQUZii4nC6ISVvjIPZiUIGkFgx5RYvrZ/47oSwxX BcV6pTut9G8A06AG/px5pq0WaB1CozoIeiHs3B+w88+paZVX4XLt4DoOS3P/LC8t/UZW Ldjj57gN2hd3LkolH4tc13lkqR40JAH7Z1clmo9rGkDKOoDjfj5e2ega496ivOuWsRMI ubdBng/qjibr5I+7qhNTHxp7do8JYstDIrkUJNNPEEvlm7EA/dohuEOKTmtlbVpve6wV GSb4LCs24lC7TVft1T9MLOhH7X0DuIjNpQp4fBpdKk98iT0VmeAXW/Itgh4phR6qpoOa NpGw== MIME-Version: 1.0 Received: by 10.152.131.9 with SMTP id oi9mr6965358lab.39.1338658078575; Sat, 02 Jun 2012 10:27:58 -0700 (PDT) Sender: asmrookie@gmail.com Received: by 10.112.27.65 with HTTP; Sat, 2 Jun 2012 10:27:58 -0700 (PDT) In-Reply-To: <20120602171632.GC2358@deviant.kiev.zoral.com.ua> References: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> <20120602164847.GB2358@deviant.kiev.zoral.com.ua> <20120602171632.GC2358@deviant.kiev.zoral.com.ua> Date: Sat, 2 Jun 2012 18:27:58 +0100 X-Google-Sender-Auth: 9_oCPwwfNLdlvGTKGIF6BSa9YTA Message-ID: From: Attilio Rao To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Cc: Alexander Kabaev , Alan Cox , Konstantin Belousov , Gianni , freebsd-arch@freebsd.org Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jun 2012 17:28:00 -0000 2012/6/2 Konstantin Belousov : > On Sat, Jun 02, 2012 at 06:00:06PM +0100, Attilio Rao wrote: >> Sorry, resending with all the recipients in. >> >> Attilio >> >> >> ---------- Forwarded message ---------- >> From: Attilio Rao >> Date: 2012/6/2 >> Subject: Re: [RFC] Kernel shared variables >> To: Konstantin Belousov >> >> >> 2012/6/2 Konstantin Belousov : >> > On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote: > [Tried to trim the text] > >> >> I think, he just wants to map in userland processes some pages from >> >> the static image of the kernel (packed together in a specific >> >> dataset). This imposes some non-trivial problem. The first thing is >> >> that the static image is not thought to have physical pages tied to >> >> it. The second is that he needs to make a clean design in order to let >> >> consumer of this mechanism to correctly locate informations they want >> >> within the shared page(s) and in the end read the correct values. >> > Right, exactly, and this is why I object to the "offsets" approach. >> > It basically moves us to the old times of the "jump tables" shared >> > libraries, that fortunately was never a case for FreeBSD even when >> > a.out was used. >> >> I'm objecting to this either. > My english is not good enough to understand this. Do you agree or disagree > with my statement that 'indexes' make it very hard to maintain ABI ? I agree with you. The offset approach just doesn't work clean on an ABI perspective. >> >> >> >> I have some reservations on both the implementation and the approach >> >> for retrieving datas from the page. >> >> In particular, I don't like that a new vm_object is allocated for this >> >> page. What I really would like would be: >> >> 1) very minimal implementation -- you just use >> >> pmap_enter()/pmap_remove() specifically when needed, separately, in >> >> fork(), execve(), etc. cases >> > Oh, this simply cannot work. >> >> And why? Assuming you provide a vm_page_t from an UMA zone just like >> fakepage do. Of course you cannot recycle for this purpose any page >> caming from vm_page_alloc(). > Due to pv_collect/pmap_pv_reclaim, the pte might be destroyed any time. > > Using hacks like mapping the page wired and then needing to hack > any VM space manipulation (fork/rfork/exec/exit/swapout/I possibly > missed several cases) just does not pay for it. Well my take was to map the page wired because of the nature of the workload too (static image -- present in memory -- wired page). [ trim ] >> The gettimeofday() implementation is a different story than what is asked here. > > But the goal is to have fast clocks, right ? What else is planned ? > > In fact, I think that if the whole goal is only fast clocks, then we > do not need any additional system mechanisms, since we can easily export > coefficients for rdtsc formula already. E.g. we can put it into elf auxv, > which is ugly but bearable. Not sure if there is anything else besides gettimeofday() that we want right now, in particular on global basis. I just mean to say that I don't think Giovanni put a lot of effort in correctness/robustness of gettimeofday userland implementation, so we should not judge that part of the patch too tightly. >> > Interesting question is how much shared the shared page needs be. >> > Obvious needs are shared between all same-ABI processes, but I can also >> > easily see a need for the per-process private information be present in >> > the 'private-shared' page. For silly but typical example, useful for >> > moronix-style benchmarks, see getpid(). >> >> Really the performance benefits of having fast getpid() is marginal if >> compared to heavilly used things like gettimeofday(). I cannot think >> of a per-process page implementing a fast syscall that can bring many >> perfomance advantages. > > This is completely true, but there may be other process-private data that > could benefit from the low access cost. I just do not know right now. I don't know either, thus I don't think there is a big urgence for per-process shared pages at all. Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Sat Jun 2 20:05:21 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 60D2C1065672; Sat, 2 Jun 2012 20:05:21 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail35.syd.optusnet.com.au (mail35.syd.optusnet.com.au [211.29.133.51]) by mx1.freebsd.org (Postfix) with ESMTP id E47598FC12; Sat, 2 Jun 2012 20:05:20 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail35.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q52K5Atd015942 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 3 Jun 2012 06:05:11 +1000 Date: Sun, 3 Jun 2012 06:05:10 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov In-Reply-To: <20120602164847.GB2358@deviant.kiev.zoral.com.ua> Message-ID: <20120603053445.Y3302@besplex.bde.org> References: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> <20120602164847.GB2358@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , alc@FreeBSD.org, Giovanni Trematerra , Alexander Kabaev , freebsd-arch@FreeBSD.org Subject: Re: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jun 2012 20:05:21 -0000 On Sat, 2 Jun 2012, Konstantin Belousov wrote: > On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote: >> ... >> I have some reservations on both the implementation and the approach >> for retrieving datas from the page. >> In particular, I don't like that a new vm_object is allocated for this >> page. What I really would like would be: >> 1) very minimal implementation -- you just use >> pmap_enter()/pmap_remove() specifically when needed, separately, in >> fork(), execve(), etc. cases > Oh, this simply cannot work. > >> 2) more complete approach -- you make a very quick layer which let you >> map pages from the static image of the kernel and the shared page >> becomes just a specific consumer of this. This way the object has much >> more sense because it becomes an object associated to all the static >> image of the kernel > So you want to circumvent the vm layer. >> >> About the layering, I don't like that you require both a kernel and >> userland header to locate the objects within the page. This is very >> likely ABI breakage prone. It is needed a mechanism for retrieving at >> run time what Giovanni calls "indexes", or making it indexes-agnostic. > > And this is what VDSO is for. VDSO with the standard ELF symbol > interposition rules allow to have libc that is completely unaware of the > shared page and 'indexes', i.e. which works both for older kernel that > do not export required index, and for new kernels that export the same > information in some more advanced format. By having VDSO that exports I have no strong ideas about the ABI issues. Even shared libraries are too large and complicated for me :-). > e.g. gettimeofday() we would get override for libc gettimeofday, while > having fully functional libc for other, future and past, kernels, even > if the format of the data exported for super-fast gettimeofday changes. Please no getttimeofday() for the example :-). > As an additional note, already put by Bruce, the implementation of > usermode gettimeofday is exactly opposite of any reasonable implementation. > It looses the precision to the frequency of the event timer. Obvious > approach is to not have any periodically updating data for gettimeofday > purpose, and use some formula with rdtsc and kernel-provided coefficients > on the machines where rdtsc is usable. Actually, you can probably do gettimeofday() by exporting mounds of excecute-only and read-only kernel code and data in the in the shared page(s). The kernel code becomes just another way of implementing a shared library that is especially good for syscalls. It needs to run with only user privilege. x86 rdtsc normally has user privilege. User privilege for timecounter hardware in bus space would be problematic. Actually^2, you only need a small amount of kernel code for this -- just microtime() and what it calls, with only the timecounter hardware call being a problem. The kernel maintains lots of not-quite-constant timecounter state (primarily timehands offsets) that can be locked in the time domain in the same way that it is in the kernel. > Interesting question is how much shared the shared page needs be. > Obvious needs are shared between all same-ABI processes, but I can also > easily see a need for the per-process private information be present in > the 'private-shared' page. For silly but typical example, useful for > moronix-style benchmarks, see getpid(). Slightly better benchmarks use getppid() since the parent pid is not quite constant so it can't easily be cached in userland. But with a kernel read-only pages, it it doesn't even need time domain locking, since getppid() is inherently racy (the parent may go away) before it returns. Lots of read-only syscalls that don't require privilege or much locking could be implemented similarly. All syscalls can be put in the shared executable page(s), with most reducing to the same library code as now to actually enter the kernel. This is too large and complicated for me. Bruce From owner-freebsd-arch@FreeBSD.ORG Sat Jun 2 21:28:16 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C6289106566B; Sat, 2 Jun 2012 21:28:16 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id 30ED98FC1E; Sat, 2 Jun 2012 21:28:13 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q52LS958002626 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 3 Jun 2012 07:28:10 +1000 Date: Sun, 3 Jun 2012 07:28:09 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov In-Reply-To: <20120602171632.GC2358@deviant.kiev.zoral.com.ua> Message-ID: <20120603063330.H3418@besplex.bde.org> References: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> <20120602164847.GB2358@deviant.kiev.zoral.com.ua> <20120602171632.GC2358@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@FreeBSD.org Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jun 2012 21:28:16 -0000 On Sat, 2 Jun 2012, Konstantin Belousov wrote: > On Sat, Jun 02, 2012 at 06:00:06PM +0100, Attilio Rao wrote: >> ... >> 2012/6/2 Konstantin Belousov : >>> On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote: > [Tried to trim the text] [Trimmed more] >>> Right, exactly, and this is why I object to the "offsets" approach. >>> It basically moves us to the old times of the "jump tables" shared >>> libraries, that fortunately was never a case for FreeBSD even when >>> a.out was used. >> >> I'm objecting to this either. > My english is not good enough to understand this. Do you agree or disagree > with my statement that 'indexes' make it very hard to maintain ABI ? Syscall numbers are basically indexes, and work OK (because there aren't many of them even after ~30-35 years of accumulating them). > ... >> The gettimeofday() implementation is a different story than what is asked here. > > But the goal is to have fast clocks, right ? What else is planned ? > > In fact, I think that if the whole goal is only fast clocks, then we > do not need any additional system mechanisms, since we can easily export > coefficients for rdtsc formula already. E.g. we can put it into elf auxv, > which is ugly but bearable. How do you get the timehands offsets? These only need to be updated every second or so, or when used, but how can the application know when they need to be updated if this is not done automatically in the kernel by writing to a shared page? I can only think of the application arranging an alarm signal every second or so and updating then. No good for libraries. rdtsc is also very unportable, even on CPUs that have it. But all other x86 timecounter hardware is too slow if you want gettimeofday() to be fast and as accurate as it is now. Bruce