Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 2 Jun 2012 14:01:35 +0100
From:      Attilio Rao <attilio@freebsd.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        alc@freebsd.org, Alexander Kabaev <kan@freebsd.org>, Giovanni Trematerra <giovanni.trematerra@gmail.com>, freebsd-arch@freebsd.org
Subject:   Re: [RFC] Kernel shared variables
Message-ID:  <CAJ-FndC71=3Jo%2BBxQi==gCoLipBxj8X8XMBydjvrcKeGw%2BWOnA@mail.gmail.com>
In-Reply-To: <20120601193522.GA2358@deviant.kiev.zoral.com.ua>
References:  <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2%2BoYo%2BwwT4ipA@mail.gmail.com> <20120601193522.GA2358@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
2012/6/1 Konstantin Belousov <kostikbel@gmail.com>:
> On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote:
>> Hello,
>> I'd like to discuss a way to provide a mechanism to share some read-only
>> data between kernel and user space programs avoiding syscall overhead,
>> implementing some them, such as gettimeofday(3) and time(3) as ordinary
>> user space routine.
>>
>> The patch at
>> http://www.trematerra.net/patches/ksvar_experimental.patch
>>
>> is in a very experimental stage. It's just a proof-of-concept.
>> Only works for an AMD64 kernel and only for 64-bit applications.
>> The idea is to have all the variables that we want to share between kern=
el
>> and user space into one or more consecutive pages of memory that will be
>> mapped read-only into every running process. At the start of the first
>> shared page
>> there'll be a table with as many entries as the number of the shared var=
iables.
>> Each entry is a 32-bit value that is the offset between the start of the=
 shared
>> page and the start of the variable in the page. The user space processes=
 need
>> to find out the map address of shared page and use the table to access t=
o the
>> shared variables.
>> Kernel will export a variable to user space as an index, so user space c=
ode
>> must refer to a specific index to access a kernel shared variable.
>> Let's take a quick look to the KPI/API for exporting/importing kernel
>> shared variables.
>> Say we want implement a routine to export an int from the kernel.
>> To define the variable to be exported inside the kernel you would use
>>
>> KSVAR_DEFINE(0, int, test_value);
>>
>> You have just defined an int variable named "test_value" at index 0.
>> Inside the kernel you can write/read as usual using the symbol test_valu=
e;
>> Now you likely want add to libc a function callable from user processes
>> that return the test_value variable. So first of all you need the import=
 the
>> variable.
>>
>> KSVAR_IMPORT(0, int, test_value);
>>
>> and to obtain a pointer to read the value you would use
>>
>> KSVAR(test_value);
>>
>> so your function would look like something like this
>>
>> int get_test_value()
>> {
>>
>> =C2=A0 =C2=A0 =C2=A0return (*KSVAR(test_value));
>> }
>>
>> Then inside your process just call get_test_value() function as you usua=
lly
>> do and you'll get a kernel written value without switching in kernel mod=
e.
>>
>> Let's see now in more detail how that could be accomplished.
>> The shared variables will be accessed as normal variables and are read/w=
rite
>> inside the kernel. The variables need to be inside the same page(s) and =
nothing
>> but the shared variables (and the table) must be into the page(s). To
>> obtain that
>> I changed the linker script in this way
>>
>> --- a/sys/conf/ldscript.amd64
>> +++ b/sys/conf/ldscript.amd64
>> @@ -177,6 +177,15 @@ SECTIONS
>> =C2=A0 =C2=A0 *(.ldata .ldata.* .gnu.linkonce.l.*)
>> =C2=A0 =C2=A0 . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1);
>> =C2=A0 }
>> + =C2=A0.ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) :
>> + =C2=A0{
>> + =C2=A0 =C2=A0__ksvar_set_start =3D .;
>> + =C2=A0 =C2=A0*(.ksvar_table)
>> + =C2=A0 =C2=A0*(.ksvar)
>> +
>> + =C2=A0 . =3D ALIGN(CONSTANT (COMMONPAGESIZE));
>> + =C2=A0 __ksvar_set_stop =3D .;
>> + =C2=A0}
>> =C2=A0 . =3D ALIGN(64 / 8);
>> =C2=A0 _end =3D .; PROVIDE (end =3D .);
>> =C2=A0 . =3D DATA_SEGMENT_END (.);
>>
>> When we want to define a variable in the kernel to share with user space
>> we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h
>>
>> +struct ksvar_set {
>> + =C2=A0 =C2=A0 =C2=A0 uint32_t idx;
>> + =C2=A0 =C2=A0 =C2=A0 char *pksvar;
>> +};
>> +
>> +/*
>> + * Declare a variable into kernel shared linker_set.
>> + */
>> +#define =C2=A0 =C2=A0 =C2=A0 =C2=A0KSVAR_DEFINE(index, type, name) \
>> + =C2=A0 =C2=A0 =C2=A0 static type name __section(".ksvar"); =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \
>> + =C2=A0 =C2=A0 =C2=A0 static struct ksvar_set name ## _ksvar_set =3D { =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .idx =3D index, =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .pksvar =3D (char *) =
&name =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0\
>> + =C2=A0 =C2=A0 =C2=A0 }; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\
>> + =C2=A0 =C2=A0 =C2=A0 DATA_SET(ksvar_set, name ## _ksvar_set)
>>
>> Every variable must have a unique index. The indexes must
>> start from zero and be consecutive. When you add an index
>> you must bump the size of the table (KSVAR_TABLE_SIZE)
>> (see sys/sys/ksvar.h)
>>
>> The variables are inside the kernel static image that isn't managed
>> by the VM and so we need to allocate pages to map the physical addresses=
.
>> A new SYSINIT (ksvarinit) will allocate a set of vm_page_t =C2=A0through
>> the vm_phys_fictitious_reg_range interface and fill the table using
>> the information
>> of the ksvar_set linker set, then will create a vm_object_t (vm_object_k=
svar),
>> mark the fake pages as valid and put them into it.
>> When a new process is created by exec(3) the vm_object_ksvar will be
>> mapped read-only into the process address space by vm_map_fixed routine
>> just before mapping the user stack. The address of mapping will be recor=
ded
>> inside the new p_ksvar field of the struct proc.
>> This field will be exported through a sysctl to the user space processes=
.
>> In order to implement syscalls as user space routines, we have to find o=
ut the
>> mapped address of the kernel shared variables when the libc is mapped in=
to
>> the process. So I added a function marked with the attribute constructor=
.
>> It will called before any code into user process and before any code ins=
ide
>> the libc.
>>
>> +__attribute((constructor)) void init_kernel_shared()
>> +{
>> + =C2=A0 =C2=A0 =C2=A0 int mib[2];
>> + =C2=A0 =C2=A0 =C2=A0 size_t len;
>> + =C2=A0 =C2=A0 =C2=A0 vm_offset_t ksvar_address;
>> +
>> + =C2=A0 =C2=A0 =C2=A0 mib[0] =3D CTL_KERN;
>> + =C2=A0 =C2=A0 =C2=A0 mib[1] =3D KERN_KSVAR;
>> + =C2=A0 =C2=A0 =C2=A0 len =3D sizeof(vm_offset_t);
>> + =C2=A0 =C2=A0 =C2=A0 if (__sysctl(mib, 2, (void *) &ksvar_address, &le=
n, NULL, 0) !=3D -1)
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ksvar_table =3D (uint=
32_t *) ksvar_address;
>> +}
>>
>> Once the libc knows the address of the table it can access to the shared
>> variables.
>>
>> Just as proof of concept I re-implemented gettimeofday(3) in user space.
>> First of all I didn't remove the entry into the syscall.master, just ren=
amed the
>> sys_gettimeofday. I need it for the fallback path.
>> In the kernel I introduced a struct wall_clock.
>>
>> +struct wall_clock
>> +{
>> + =C2=A0 =C2=A0 =C2=A0 struct timeval =C2=A0tv;
>> + =C2=A0 =C2=A0 =C2=A0 struct timezone tz;
>> +};
>>
>> The struct is exported through sys/sys/time.h header.
>> I defined a new kernel shared variable. To do so I added an index in
>> sys/sys/ksvar.h
>> WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1.
>> In the sys/kern/kern_clocksource.c
>>
>> +/* kernel shared variable for implmenting gettimeofday. */
>> +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
>>
>> Now we defined a shared variable at index WALL_CLOCK_INDEX of type
>> struct wall_clock and named wall_clock.
>> Inside handleevents I update the info exported by wall_clock.
>>
>> + =C2=A0 =C2=A0 =C2=A0 struct timeval tv;
>> +
>> + =C2=A0 =C2=A0 =C2=A0 /* update time for userspace gettimeofday */
>> + =C2=A0 =C2=A0 =C2=A0 microtime(&tv);
>> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tv =3D tv;
>> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_minuteswest =3D tz_minuteswest;
>> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_dsttime =3D tz_dsttime;
>>
>> Now, in libc we import the shared variable
>>
>> +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
>>
>> note that WALL_CLOCK_INDEX must be the same of the one defined
>> inside the kernel, and define a new function gettimeofday
>>
>> +int
>> +gettimeofday(struct timeval *tp, struct timezone *tzp)
>> +{
>> +
>> + =C2=A0 =C2=A0 =C2=A0 /* fallback to syscall if kernel doesn't export k=
svar */
>> + =C2=A0 =C2=A0 =C2=A0 if (!KSVAR_IS_ACTIVE())
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return (sys_gettimeof=
day(tp, tzp));
>> +
>> + =C2=A0 =C2=A0 =C2=A0 if (tp !=3D NULL)
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tp =3D KSVAR(wall_cl=
ock)->tv;
>> + =C2=A0 =C2=A0 =C2=A0 if (tzp !=3D NULL)
>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tzp =3D KSVAR(wall_c=
lock)->tz;
>> + =C2=A0 =C2=A0 =C2=A0 return (0);
>> +}
>>
>> Now when a process will call getimeofday, will call that function actual=
ly.
>> If the process makes a lot of call to gettimeofday, we will see a
>> performance boost.
>> Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE),
>> the function
>> fallback to call the actual syscall (sys_gettimeofday).
>>
>> Open tasks
>> - implement support for 32-bit emulated processes running in a 64-bit
>> environment.
>> - extend support to others arch
>> - implement more syscalls
>> - benchmarks
>> - Test, test, test.
>>
>> I'm looking forward to hear about your comments and suggestions.
>
> I very much dislike what you described, it makes ABI maintanence
> a nightmare.
> Below is some mail I wrote around Spring 2009, making some notes about
> desired proposal. This is what called vdso in Linux land.

Did you bother to read at least Giovanni's description?
Because this has nothing to do with VDSO in Linux.

I think, he just wants to map in userland processes some pages from
the static image of the kernel (packed together in a specific
dataset). This imposes some non-trivial problem. The first thing is
that the static image is not thought to have physical pages tied to
it. The second is that he needs to make a clean design in order to let
consumer of this mechanism to correctly locate informations they want
within the shared page(s) and in the end read the correct values.

I have some reservations on both the implementation and the approach
for retrieving datas from the page.
In particular, I don't like that a new vm_object is allocated for this
page. What I really would like would be:
1) very minimal implementation -- you just use
pmap_enter()/pmap_remove() specifically when needed, separately, in
fork(), execve(), etc. cases
2) more complete approach -- you make a very quick layer which let you
map pages from the static image of the kernel and the shared page
becomes just a specific consumer of this. This way the object has much
more sense because it becomes an object associated to all the static
image of the kernel

About the layering, I don't like that you require both a kernel and
userland header to locate the objects within the page. This is very
likely ABI breakage prone. It is needed a mechanism for retrieving at
run time what Giovanni calls "indexes", or making it indexes-agnostic.

Attilio


--=20
Peace can only be achieved by understanding - A. Einstein



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-FndC71=3Jo%2BBxQi==gCoLipBxj8X8XMBydjvrcKeGw%2BWOnA>