Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 1 Jun 2012 22:35:22 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Giovanni Trematerra <giovanni.trematerra@gmail.com>
Cc:        Attilio Rao <attilio@freebsd.org>, alc@freebsd.org, Alexander Kabaev <kan@freebsd.org>, freebsd-arch@freebsd.org
Subject:   Re: [RFC] Kernel shared variables
Message-ID:  <20120601193522.GA2358@deviant.kiev.zoral.com.ua>
In-Reply-To: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2%2BoYo%2BwwT4ipA@mail.gmail.com>
References:  <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2%2BoYo%2BwwT4ipA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--SRK8lRENmpuaYFQC
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote:
> Hello,
> I'd like to discuss a way to provide a mechanism to share some read-only
> data between kernel and user space programs avoiding syscall overhead,
> implementing some them, such as gettimeofday(3) and time(3) as ordinary
> user space routine.
>=20
> The patch at
> http://www.trematerra.net/patches/ksvar_experimental.patch
>=20
> is in a very experimental stage. It's just a proof-of-concept.
> Only works for an AMD64 kernel and only for 64-bit applications.
> The idea is to have all the variables that we want to share between kernel
> and user space into one or more consecutive pages of memory that will be
> mapped read-only into every running process. At the start of the first
> shared page
> there'll be a table with as many entries as the number of the shared vari=
ables.
> Each entry is a 32-bit value that is the offset between the start of the =
shared
> page and the start of the variable in the page. The user space processes =
need
> to find out the map address of shared page and use the table to access to=
 the
> shared variables.
> Kernel will export a variable to user space as an index, so user space co=
de
> must refer to a specific index to access a kernel shared variable.
> Let's take a quick look to the KPI/API for exporting/importing kernel
> shared variables.
> Say we want implement a routine to export an int from the kernel.
> To define the variable to be exported inside the kernel you would use
>=20
> KSVAR_DEFINE(0, int, test_value);
>=20
> You have just defined an int variable named "test_value" at index 0.
> Inside the kernel you can write/read as usual using the symbol test_value;
> Now you likely want add to libc a function callable from user processes
> that return the test_value variable. So first of all you need the import =
the
> variable.
>=20
> KSVAR_IMPORT(0, int, test_value);
>=20
> and to obtain a pointer to read the value you would use
>=20
> KSVAR(test_value);
>=20
> so your function would look like something like this
>=20
> int get_test_value()
> {
>=20
>      return (*KSVAR(test_value));
> }
>=20
> Then inside your process just call get_test_value() function as you usual=
ly
> do and you'll get a kernel written value without switching in kernel mode.
>=20
> Let's see now in more detail how that could be accomplished.
> The shared variables will be accessed as normal variables and are read/wr=
ite
> inside the kernel. The variables need to be inside the same page(s) and n=
othing
> but the shared variables (and the table) must be into the page(s). To
> obtain that
> I changed the linker script in this way
>=20
> --- a/sys/conf/ldscript.amd64
> +++ b/sys/conf/ldscript.amd64
> @@ -177,6 +177,15 @@ SECTIONS
>     *(.ldata .ldata.* .gnu.linkonce.l.*)
>     . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1);
>   }
> +  .ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) :
> +  {
> +    __ksvar_set_start =3D .;
> +    *(.ksvar_table)
> +    *(.ksvar)
> +
> +   . =3D ALIGN(CONSTANT (COMMONPAGESIZE));
> +   __ksvar_set_stop =3D .;
> +  }
>   . =3D ALIGN(64 / 8);
>   _end =3D .; PROVIDE (end =3D .);
>   . =3D DATA_SEGMENT_END (.);
>=20
> When we want to define a variable in the kernel to share with user space
> we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h
>=20
> +struct ksvar_set {
> +       uint32_t idx;
> +       char *pksvar;
> +};
> +
> +/*
> + * Declare a variable into kernel shared linker_set.
> + */
> +#define        KSVAR_DEFINE(index, type, name) \
> +       static type name __section(".ksvar");                   \
> +       static struct ksvar_set name ## _ksvar_set =3D {          \
> +               .idx =3D index,                                   \
> +               .pksvar =3D (char *) &name                        \
> +       };                                                      \
> +       DATA_SET(ksvar_set, name ## _ksvar_set)
>=20
> Every variable must have a unique index. The indexes must
> start from zero and be consecutive. When you add an index
> you must bump the size of the table (KSVAR_TABLE_SIZE)
> (see sys/sys/ksvar.h)
>=20
> The variables are inside the kernel static image that isn't managed
> by the VM and so we need to allocate pages to map the physical addresses.
> A new SYSINIT (ksvarinit) will allocate a set of vm_page_t  through
> the vm_phys_fictitious_reg_range interface and fill the table using
> the information
> of the ksvar_set linker set, then will create a vm_object_t (vm_object_ks=
var),
> mark the fake pages as valid and put them into it.
> When a new process is created by exec(3) the vm_object_ksvar will be
> mapped read-only into the process address space by vm_map_fixed routine
> just before mapping the user stack. The address of mapping will be record=
ed
> inside the new p_ksvar field of the struct proc.
> This field will be exported through a sysctl to the user space processes.
> In order to implement syscalls as user space routines, we have to find ou=
t the
> mapped address of the kernel shared variables when the libc is mapped into
> the process. So I added a function marked with the attribute constructor.
> It will called before any code into user process and before any code insi=
de
> the libc.
>=20
> +__attribute((constructor)) void init_kernel_shared()
> +{
> +       int mib[2];
> +       size_t len;
> +       vm_offset_t ksvar_address;
> +
> +       mib[0] =3D CTL_KERN;
> +       mib[1] =3D KERN_KSVAR;
> +       len =3D sizeof(vm_offset_t);
> +       if (__sysctl(mib, 2, (void *) &ksvar_address, &len, NULL, 0) !=3D=
 -1)
> +               ksvar_table =3D (uint32_t *) ksvar_address;
> +}
>=20
> Once the libc knows the address of the table it can access to the shared
> variables.
>=20
> Just as proof of concept I re-implemented gettimeofday(3) in user space.
> First of all I didn't remove the entry into the syscall.master, just rena=
med the
> sys_gettimeofday. I need it for the fallback path.
> In the kernel I introduced a struct wall_clock.
>=20
> +struct wall_clock
> +{
> +       struct timeval  tv;
> +       struct timezone tz;
> +};
>=20
> The struct is exported through sys/sys/time.h header.
> I defined a new kernel shared variable. To do so I added an index in
> sys/sys/ksvar.h
> WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1.
> In the sys/kern/kern_clocksource.c
>=20
> +/* kernel shared variable for implmenting gettimeofday. */
> +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
>=20
> Now we defined a shared variable at index WALL_CLOCK_INDEX of type
> struct wall_clock and named wall_clock.
> Inside handleevents I update the info exported by wall_clock.
>=20
> +       struct timeval tv;
> +
> +       /* update time for userspace gettimeofday */
> +       microtime(&tv);
> +       wall_clock.tv =3D tv;
> +       wall_clock.tz.tz_minuteswest =3D tz_minuteswest;
> +       wall_clock.tz.tz_dsttime =3D tz_dsttime;
>=20
> Now, in libc we import the shared variable
>=20
> +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
>=20
> note that WALL_CLOCK_INDEX must be the same of the one defined
> inside the kernel, and define a new function gettimeofday
>=20
> +int
> +gettimeofday(struct timeval *tp, struct timezone *tzp)
> +{
> +
> +       /* fallback to syscall if kernel doesn't export ksvar */
> +       if (!KSVAR_IS_ACTIVE())
> +               return (sys_gettimeofday(tp, tzp));
> +
> +       if (tp !=3D NULL)
> +               *tp =3D KSVAR(wall_clock)->tv;
> +       if (tzp !=3D NULL)
> +               *tzp =3D KSVAR(wall_clock)->tz;
> +       return (0);
> +}
>=20
> Now when a process will call getimeofday, will call that function actuall=
y.
> If the process makes a lot of call to gettimeofday, we will see a
> performance boost.
> Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE),
> the function
> fallback to call the actual syscall (sys_gettimeofday).
>=20
> Open tasks
> - implement support for 32-bit emulated processes running in a 64-bit
> environment.
> - extend support to others arch
> - implement more syscalls
> - benchmarks
> - Test, test, test.
>=20
> I'm looking forward to hear about your comments and suggestions.

I very much dislike what you described, it makes ABI maintanence
a nightmare.
Below is some mail I wrote around Spring 2009, making some notes about
desired proposal. This is what called vdso in Linux land.


On Tue, Mar 31, 2009 at 04:04:46PM +0200, Giuseppe Cocomazzi wrote:
> Gentle kib,
> I've understood what you mean to do: you said me to implement the=20
> syscall trampoline as a dynamic shared object to be copied by the kernel=
=20
> in every process shared page; then we would eventually pass the shared=20
> page address to the rtld using a AT_SYSINFO_EHDR. During program=20
> startup, if this is found by the dyn linker, we define a symbol=20
> containing that previously obtained address, which libc could easily
> access for its own syscall wrappers. Ok? Is this your idea? Or didn't I=
=20
> get it at all?
> Now, if I got what you meant, let me explain my already done work on the=
=20
> syscall trampoline.
> My approach does not make use of any dso: the kernel just copies a=20
> little piece of code in the syscall trampoline shared page:
> 	popl	%ecx
> 	int	$0x80
> 	pushl	%ecx
> without any symbol. This would be changed in its sysenter counterpart by=
=20
> cpu_startup, in case the SEP bit is set. A sysctl has been created in=20
> order to let user programs obtain that sc trampoline address.
> Crt has been patched to retrieve the address by means of the sysctl at=20
> run time and then puts this address in a global symbol named 'sctramp'.
> (I want you to know that the sysctl mechanism could be simply avoided if=
=20
>   we decide to have syscall shared pages at a fixed address: actually=20
> they are mapped to maxsaddr. This need to be discussed later, but is not=
=20
> the hot point, now.)
> The 'sctramp' symbol is accessed by libc wrappers to enter the kernel=20
> when issuing system calls:
> 	#define KERNCALL	call *%sctramp
> What I want to say is the following: I think your approach is the same=20
> as mine, in that the rtld has to load a shared object in any case, being=
=20
> it crt or a custom dso. But since crt is already there, why do we need=20
> to create another .so when we already have one which is however linked=20
> in the process address space? Think of crt as your custom dso, and=20
> you'll get the picture.
> Maybe, your approach is more elegant than mine, though mine is more=20
> minimal and less invasive (no symbols to take into account but one,=20
> etc). Furthermore, don't understimate the fact that I've already coded=20
> and tested it: I attach a copy of the whole patch, so that you can have=
=20
> a look at it, accompanied by a little explaining paper. Don't waste your=
=20
> time in reviewing the kernel part (this is rookie's task), concentrate=20
> on the user space part.
> Hope I don't cause a waste of your precious time,
> Regards

Crt is not dso. It is the stub that got linked statically to most
binaries. The actual mechanism you implemented is _ortohonal_ to
decision of having shared page as a dso.

That dso shall not be used to provide any "addresses" to libc. Libc
syscall stabs shall call some functional symbol, that is defined strong
in the dso, and weak in the rtld. Rtld implementation shall be int0x80.
Rtld shall preload the dso, assuming the aux entry supplied by the
kernel contains phdr address of the object.

Features that gives us the dso:
1. Absolute freedom in the layout of the page.
2. Page may implement several entries, among them are
	- syscall (that is what you described above);
	- gettimeofday with optimized implementation (see long threads
		about TSC, APCI HPET etc);
	- getpid
	- machine-optimized copy routines like memcpy, strcpy and so on.
	- signal trampoline code (see #5 below, why this is _very_
		desirable).
	- ... (there I have stopped my imagination)
3. Addition of new symbols does not require any changes to libc to activate
   them, because the standard behaviour of the dynamic linker gives the
   priority to the symbols from the preloaded objects over the symbols
   from the dependencies.
4. Dso gives the right place for the CFI to be found by debuggers and
   exception propagation code (CFI stands for Call Frame Information,
   it is used to allow the stack unwinding to properly restore frames
   and registers). amd64 already suffers from the lack of CFI on signal
   trampolines and sysenter wrappers. Bare shared page is ugly from
   this point of view. Need for CFI was one of the main motivation
   for the dso on Linux.
5. Putting signal trampoline into the shared page instead of top of the
   stack would be a great step into enabling NX bit for the stack.
6. Linuxolator would get the vdso too, that is big deficiency in it now.

As you see, list of the items that are desirable on the shared page
is quite long, and having fixed format is the large problem for
binary compatibility.


--SRK8lRENmpuaYFQC
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk/JGXkACgkQC3+MBN1Mb4iqcQCeKC+6UcscqSD0AkKnVu1QPiTu
VrUAoI0hxz1U92+l9Ka0acuRJXg42AV5
=8QvG
-----END PGP SIGNATURE-----

--SRK8lRENmpuaYFQC--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120601193522.GA2358>