Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 2 Jun 2012 19:48:47 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Attilio Rao <attilio@freebsd.org>
Cc:        alc@freebsd.org, Alexander Kabaev <kan@freebsd.org>, Giovanni Trematerra <giovanni.trematerra@gmail.com>, freebsd-arch@freebsd.org
Subject:   Re: [RFC] Kernel shared variables
Message-ID:  <20120602164847.GB2358@deviant.kiev.zoral.com.ua>
In-Reply-To: <CAJ-FndC71=3Jo%2BBxQi==gCoLipBxj8X8XMBydjvrcKeGw%2BWOnA@mail.gmail.com>
References:  <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2%2BoYo%2BwwT4ipA@mail.gmail.com> <20120601193522.GA2358@deviant.kiev.zoral.com.ua> <CAJ-FndC71=3Jo%2BBxQi==gCoLipBxj8X8XMBydjvrcKeGw%2BWOnA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--gQmjQz8lQ7hwrZL9
Content-Type: text/plain; charset=koi8-r
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote:
> 2012/6/1 Konstantin Belousov <kostikbel@gmail.com>:
> > On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote:
> >> Hello,
> >> I'd like to discuss a way to provide a mechanism to share some read-on=
ly
> >> data between kernel and user space programs avoiding syscall overhead,
> >> implementing some them, such as gettimeofday(3) and time(3) as ordinary
> >> user space routine.
> >>
> >> The patch at
> >> http://www.trematerra.net/patches/ksvar_experimental.patch
> >>
> >> is in a very experimental stage. It's just a proof-of-concept.
> >> Only works for an AMD64 kernel and only for 64-bit applications.
> >> The idea is to have all the variables that we want to share between ke=
rnel
> >> and user space into one or more consecutive pages of memory that will =
be
> >> mapped read-only into every running process. At the start of the first
> >> shared page
> >> there'll be a table with as many entries as the number of the shared v=
ariables.
> >> Each entry is a 32-bit value that is the offset between the start of t=
he shared
> >> page and the start of the variable in the page. The user space process=
es need
> >> to find out the map address of shared page and use the table to access=
 to the
> >> shared variables.
> >> Kernel will export a variable to user space as an index, so user space=
 code
> >> must refer to a specific index to access a kernel shared variable.
> >> Let's take a quick look to the KPI/API for exporting/importing kernel
> >> shared variables.
> >> Say we want implement a routine to export an int from the kernel.
> >> To define the variable to be exported inside the kernel you would use
> >>
> >> KSVAR_DEFINE(0, int, test_value);
> >>
> >> You have just defined an int variable named "test_value" at index 0.
> >> Inside the kernel you can write/read as usual using the symbol test_va=
lue;
> >> Now you likely want add to libc a function callable from user processes
> >> that return the test_value variable. So first of all you need the impo=
rt the
> >> variable.
> >>
> >> KSVAR_IMPORT(0, int, test_value);
> >>
> >> and to obtain a pointer to read the value you would use
> >>
> >> KSVAR(test_value);
> >>
> >> so your function would look like something like this
> >>
> >> int get_test_value()
> >> {
> >>
> >> =9A =9A =9Areturn (*KSVAR(test_value));
> >> }
> >>
> >> Then inside your process just call get_test_value() function as you us=
ually
> >> do and you'll get a kernel written value without switching in kernel m=
ode.
> >>
> >> Let's see now in more detail how that could be accomplished.
> >> The shared variables will be accessed as normal variables and are read=
/write
> >> inside the kernel. The variables need to be inside the same page(s) an=
d nothing
> >> but the shared variables (and the table) must be into the page(s). To
> >> obtain that
> >> I changed the linker script in this way
> >>
> >> --- a/sys/conf/ldscript.amd64
> >> +++ b/sys/conf/ldscript.amd64
> >> @@ -177,6 +177,15 @@ SECTIONS
> >> =9A =9A *(.ldata .ldata.* .gnu.linkonce.l.*)
> >> =9A =9A . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1);
> >> =9A }
> >> + =9A.ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) :
> >> + =9A{
> >> + =9A =9A__ksvar_set_start =3D .;
> >> + =9A =9A*(.ksvar_table)
> >> + =9A =9A*(.ksvar)
> >> +
> >> + =9A . =3D ALIGN(CONSTANT (COMMONPAGESIZE));
> >> + =9A __ksvar_set_stop =3D .;
> >> + =9A}
> >> =9A . =3D ALIGN(64 / 8);
> >> =9A _end =3D .; PROVIDE (end =3D .);
> >> =9A . =3D DATA_SEGMENT_END (.);
> >>
> >> When we want to define a variable in the kernel to share with user spa=
ce
> >> we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h
> >>
> >> +struct ksvar_set {
> >> + =9A =9A =9A uint32_t idx;
> >> + =9A =9A =9A char *pksvar;
> >> +};
> >> +
> >> +/*
> >> + * Declare a variable into kernel shared linker_set.
> >> + */
> >> +#define =9A =9A =9A =9AKSVAR_DEFINE(index, type, name) \
> >> + =9A =9A =9A static type name __section(".ksvar"); =9A =9A =9A =9A =
=9A =9A =9A =9A =9A \
> >> + =9A =9A =9A static struct ksvar_set name ## _ksvar_set =3D { =9A =9A=
 =9A =9A =9A\
> >> + =9A =9A =9A =9A =9A =9A =9A .idx =3D index, =9A =9A =9A =9A =9A =9A =
=9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A \
> >> + =9A =9A =9A =9A =9A =9A =9A .pksvar =3D (char *) &name =9A =9A =9A =
=9A =9A =9A =9A =9A =9A =9A =9A =9A\
> >> + =9A =9A =9A }; =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =
=9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A\
> >> + =9A =9A =9A DATA_SET(ksvar_set, name ## _ksvar_set)
> >>
> >> Every variable must have a unique index. The indexes must
> >> start from zero and be consecutive. When you add an index
> >> you must bump the size of the table (KSVAR_TABLE_SIZE)
> >> (see sys/sys/ksvar.h)
> >>
> >> The variables are inside the kernel static image that isn't managed
> >> by the VM and so we need to allocate pages to map the physical address=
es.
> >> A new SYSINIT (ksvarinit) will allocate a set of vm_page_t =9Athrough
> >> the vm_phys_fictitious_reg_range interface and fill the table using
> >> the information
> >> of the ksvar_set linker set, then will create a vm_object_t (vm_object=
_ksvar),
> >> mark the fake pages as valid and put them into it.
> >> When a new process is created by exec(3) the vm_object_ksvar will be
> >> mapped read-only into the process address space by vm_map_fixed routine
> >> just before mapping the user stack. The address of mapping will be rec=
orded
> >> inside the new p_ksvar field of the struct proc.
> >> This field will be exported through a sysctl to the user space process=
es.
> >> In order to implement syscalls as user space routines, we have to find=
 out the
> >> mapped address of the kernel shared variables when the libc is mapped =
into
> >> the process. So I added a function marked with the attribute construct=
or.
> >> It will called before any code into user process and before any code i=
nside
> >> the libc.
> >>
> >> +__attribute((constructor)) void init_kernel_shared()
> >> +{
> >> + =9A =9A =9A int mib[2];
> >> + =9A =9A =9A size_t len;
> >> + =9A =9A =9A vm_offset_t ksvar_address;
> >> +
> >> + =9A =9A =9A mib[0] =3D CTL_KERN;
> >> + =9A =9A =9A mib[1] =3D KERN_KSVAR;
> >> + =9A =9A =9A len =3D sizeof(vm_offset_t);
> >> + =9A =9A =9A if (__sysctl(mib, 2, (void *) &ksvar_address, &len, NULL=
, 0) !=3D -1)
> >> + =9A =9A =9A =9A =9A =9A =9A ksvar_table =3D (uint32_t *) ksvar_addre=
ss;
> >> +}
> >>
> >> Once the libc knows the address of the table it can access to the shar=
ed
> >> variables.
> >>
> >> Just as proof of concept I re-implemented gettimeofday(3) in user spac=
e.
> >> First of all I didn't remove the entry into the syscall.master, just r=
enamed the
> >> sys_gettimeofday. I need it for the fallback path.
> >> In the kernel I introduced a struct wall_clock.
> >>
> >> +struct wall_clock
> >> +{
> >> + =9A =9A =9A struct timeval =9Atv;
> >> + =9A =9A =9A struct timezone tz;
> >> +};
> >>
> >> The struct is exported through sys/sys/time.h header.
> >> I defined a new kernel shared variable. To do so I added an index in
> >> sys/sys/ksvar.h
> >> WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1.
> >> In the sys/kern/kern_clocksource.c
> >>
> >> +/* kernel shared variable for implmenting gettimeofday. */
> >> +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
> >>
> >> Now we defined a shared variable at index WALL_CLOCK_INDEX of type
> >> struct wall_clock and named wall_clock.
> >> Inside handleevents I update the info exported by wall_clock.
> >>
> >> + =9A =9A =9A struct timeval tv;
> >> +
> >> + =9A =9A =9A /* update time for userspace gettimeofday */
> >> + =9A =9A =9A microtime(&tv);
> >> + =9A =9A =9A wall_clock.tv =3D tv;
> >> + =9A =9A =9A wall_clock.tz.tz_minuteswest =3D tz_minuteswest;
> >> + =9A =9A =9A wall_clock.tz.tz_dsttime =3D tz_dsttime;
> >>
> >> Now, in libc we import the shared variable
> >>
> >> +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
> >>
> >> note that WALL_CLOCK_INDEX must be the same of the one defined
> >> inside the kernel, and define a new function gettimeofday
> >>
> >> +int
> >> +gettimeofday(struct timeval *tp, struct timezone *tzp)
> >> +{
> >> +
> >> + =9A =9A =9A /* fallback to syscall if kernel doesn't export ksvar */
> >> + =9A =9A =9A if (!KSVAR_IS_ACTIVE())
> >> + =9A =9A =9A =9A =9A =9A =9A return (sys_gettimeofday(tp, tzp));
> >> +
> >> + =9A =9A =9A if (tp !=3D NULL)
> >> + =9A =9A =9A =9A =9A =9A =9A *tp =3D KSVAR(wall_clock)->tv;
> >> + =9A =9A =9A if (tzp !=3D NULL)
> >> + =9A =9A =9A =9A =9A =9A =9A *tzp =3D KSVAR(wall_clock)->tz;
> >> + =9A =9A =9A return (0);
> >> +}
> >>
> >> Now when a process will call getimeofday, will call that function actu=
ally.
> >> If the process makes a lot of call to gettimeofday, we will see a
> >> performance boost.
> >> Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE),
> >> the function
> >> fallback to call the actual syscall (sys_gettimeofday).
> >>
> >> Open tasks
> >> - implement support for 32-bit emulated processes running in a 64-bit
> >> environment.
> >> - extend support to others arch
> >> - implement more syscalls
> >> - benchmarks
> >> - Test, test, test.
> >>
> >> I'm looking forward to hear about your comments and suggestions.
> >
> > I very much dislike what you described, it makes ABI maintanence
> > a nightmare.
> > Below is some mail I wrote around Spring 2009, making some notes about
> > desired proposal. This is what called vdso in Linux land.
>=20
> Did you bother to read at least Giovanni's description?
> Because this has nothing to do with VDSO in Linux.
Did you bothered to think shortly why do I object ?

>=20
> I think, he just wants to map in userland processes some pages from
> the static image of the kernel (packed together in a specific
> dataset). This imposes some non-trivial problem. The first thing is
> that the static image is not thought to have physical pages tied to
> it. The second is that he needs to make a clean design in order to let
> consumer of this mechanism to correctly locate informations they want
> within the shared page(s) and in the end read the correct values.
Right, exactly, and this is why I object to the "offsets" approach.
It basically moves us to the old times of the "jump tables" shared
libraries, that fortunately was never a case for FreeBSD even when
a.out was used.

>=20
> I have some reservations on both the implementation and the approach
> for retrieving datas from the page.
> In particular, I don't like that a new vm_object is allocated for this
> page. What I really would like would be:
> 1) very minimal implementation -- you just use
> pmap_enter()/pmap_remove() specifically when needed, separately, in
> fork(), execve(), etc. cases
Oh, this simply cannot work.

> 2) more complete approach -- you make a very quick layer which let you
> map pages from the static image of the kernel and the shared page
> becomes just a specific consumer of this. This way the object has much
> more sense because it becomes an object associated to all the static
> image of the kernel
So you want to circumvent the vm layer.

>=20
> About the layering, I don't like that you require both a kernel and
> userland header to locate the objects within the page. This is very
> likely ABI breakage prone. It is needed a mechanism for retrieving at
> run time what Giovanni calls "indexes", or making it indexes-agnostic.

And this is what VDSO is for. VDSO with the standard ELF symbol
interposition rules allow to have libc that is completely unaware of the
shared page and 'indexes', i.e. which works both for older kernel that
do not export required index, and for new kernels that export the same
information in some more advanced format. By having VDSO that exports
e.g. gettimeofday() we would get override for libc gettimeofday, while
having fully functional libc for other, future and past, kernels, even
if the format of the data exported for super-fast gettimeofday changes.

The tight between VDSO and kernel is not a problem, since VDSO is part
of the kernel from the deployment POV. More. either existing ELF
linker in kernel, or some trivial modifications to it, would allow
to not use 'indexes' on the kernel side too.

We already have a shared page between kernel and whole set of the same-ABI
processes. Currently it is used for signal trampolines only.
The hard parts of the task is to provide VDSO build glue. Also IMO the
hard task is to define sensible gettimeofday() implementation, probably
using rdtsc in usermode. Shared page is easy, or at least it is already
there without ugly and non-working vm hacks.

As an additional note, already put by Bruce, the implementation of
usermode gettimeofday is exactly opposite of any reasonable implementation.
It looses the precision to the frequency of the event timer. Obvious
approach is to not have any periodically updating data for gettimeofday
purpose, and use some formula with rdtsc and kernel-provided coefficients
on the machines where rdtsc is usable.

Interesting question is how much shared the shared page needs be.
Obvious needs are shared between all same-ABI processes, but I can also
easily see a need for the per-process private information be present in
the 'private-shared' page. For silly but typical example, useful for
moronix-style benchmarks, see getpid().

Shrug.

--gQmjQz8lQ7hwrZL9
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk/KQ+8ACgkQC3+MBN1Mb4g2NACgkLX/iLA3GzLGxP81Orzy+X7G
GVEAoIuyoHDauMOErYp+wNLxNWZp5vBF
=gT67
-----END PGP SIGNATURE-----

--gQmjQz8lQ7hwrZL9--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120602164847.GB2358>