Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 2 Jun 2012 18:00:06 +0100
From:      Attilio Rao <attilio@freebsd.org>
To:        freebsd-arch@freebsd.org, Gianni <gianni@freebsd.org>,  Alexander Kabaev <kan@freebsd.org>, Alan Cox <alc@rice.edu>, Konstantin Belousov <kib@freebsd.org>
Subject:   Fwd: [RFC] Kernel shared variables
Message-ID:  <CAJ-FndCpztSWyJo2hRVs5qu%2BvQOj9E1mPBhfVOxM_OC2eNac6A@mail.gmail.com>
In-Reply-To: <CAJ-FndAXFwuEspq%2BQeF0Hv1dr8JjREP=c=g3-abP=eoZ-D4hEg@mail.gmail.com>
References:  <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2%2BoYo%2BwwT4ipA@mail.gmail.com> <20120601193522.GA2358@deviant.kiev.zoral.com.ua> <CAJ-FndC71=3Jo%2BBxQi==gCoLipBxj8X8XMBydjvrcKeGw%2BWOnA@mail.gmail.com> <20120602164847.GB2358@deviant.kiev.zoral.com.ua> <CAJ-FndAXFwuEspq%2BQeF0Hv1dr8JjREP=c=g3-abP=eoZ-D4hEg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Sorry, resending with all the recipients in.

Attilio


---------- Forwarded message ----------
From: Attilio Rao <attilio@freebsd.org>
Date: 2012/6/2
Subject: Re: [RFC] Kernel shared variables
To: Konstantin Belousov <kostikbel@gmail.com>


2012/6/2 Konstantin Belousov <kostikbel@gmail.com>:
> On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote:
>> 2012/6/1 Konstantin Belousov <kostikbel@gmail.com>:
>> > On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote:
>> >> Hello,
>> >> I'd like to discuss a way to provide a mechanism to share some read-o=
nly
>> >> data between kernel and user space programs avoiding syscall overhead=
,
>> >> implementing some them, such as gettimeofday(3) and time(3) as ordina=
ry
>> >> user space routine.
>> >>
>> >> The patch at
>> >> http://www.trematerra.net/patches/ksvar_experimental.patch
>> >>
>> >> is in a very experimental stage. It's just a proof-of-concept.
>> >> Only works for an AMD64 kernel and only for 64-bit applications.
>> >> The idea is to have all the variables that we want to share between k=
ernel
>> >> and user space into one or more consecutive pages of memory that will=
 be
>> >> mapped read-only into every running process. At the start of the firs=
t
>> >> shared page
>> >> there'll be a table with as many entries as the number of the shared =
variables.
>> >> Each entry is a 32-bit value that is the offset between the start of =
the shared
>> >> page and the start of the variable in the page. The user space proces=
ses need
>> >> to find out the map address of shared page and use the table to acces=
s to the
>> >> shared variables.
>> >> Kernel will export a variable to user space as an index, so user spac=
e code
>> >> must refer to a specific index to access a kernel shared variable.
>> >> Let's take a quick look to the KPI/API for exporting/importing kernel
>> >> shared variables.
>> >> Say we want implement a routine to export an int from the kernel.
>> >> To define the variable to be exported inside the kernel you would use
>> >>
>> >> KSVAR_DEFINE(0, int, test_value);
>> >>
>> >> You have just defined an int variable named "test_value" at index 0.
>> >> Inside the kernel you can write/read as usual using the symbol test_v=
alue;
>> >> Now you likely want add to libc a function callable from user process=
es
>> >> that return the test_value variable. So first of all you need the imp=
ort the
>> >> variable.
>> >>
>> >> KSVAR_IMPORT(0, int, test_value);
>> >>
>> >> and to obtain a pointer to read the value you would use
>> >>
>> >> KSVAR(test_value);
>> >>
>> >> so your function would look like something like this
>> >>
>> >> int get_test_value()
>> >> {
>> >>
>> >> =C2=A0 =C2=A0 =C2=A0return (*KSVAR(test_value));
>> >> }
>> >>
>> >> Then inside your process just call get_test_value() function as you u=
sually
>> >> do and you'll get a kernel written value without switching in kernel =
mode.
>> >>
>> >> Let's see now in more detail how that could be accomplished.
>> >> The shared variables will be accessed as normal variables and are rea=
d/write
>> >> inside the kernel. The variables need to be inside the same page(s) a=
nd nothing
>> >> but the shared variables (and the table) must be into the page(s). To
>> >> obtain that
>> >> I changed the linker script in this way
>> >>
>> >> --- a/sys/conf/ldscript.amd64
>> >> +++ b/sys/conf/ldscript.amd64
>> >> @@ -177,6 +177,15 @@ SECTIONS
>> >> =C2=A0 =C2=A0 *(.ldata .ldata.* .gnu.linkonce.l.*)
>> >> =C2=A0 =C2=A0 . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1);
>> >> =C2=A0 }
>> >> + =C2=A0.ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) :
>> >> + =C2=A0{
>> >> + =C2=A0 =C2=A0__ksvar_set_start =3D .;
>> >> + =C2=A0 =C2=A0*(.ksvar_table)
>> >> + =C2=A0 =C2=A0*(.ksvar)
>> >> +
>> >> + =C2=A0 . =3D ALIGN(CONSTANT (COMMONPAGESIZE));
>> >> + =C2=A0 __ksvar_set_stop =3D .;
>> >> + =C2=A0}
>> >> =C2=A0 . =3D ALIGN(64 / 8);
>> >> =C2=A0 _end =3D .; PROVIDE (end =3D .);
>> >> =C2=A0 . =3D DATA_SEGMENT_END (.);
>> >>
>> >> When we want to define a variable in the kernel to share with user sp=
ace
>> >> we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h
>> >>
>> >> +struct ksvar_set {
>> >> + =C2=A0 =C2=A0 =C2=A0 uint32_t idx;
>> >> + =C2=A0 =C2=A0 =C2=A0 char *pksvar;
>> >> +};
>> >> +
>> >> +/*
>> >> + * Declare a variable into kernel shared linker_set.
>> >> + */
>> >> +#define =C2=A0 =C2=A0 =C2=A0 =C2=A0KSVAR_DEFINE(index, type, name) \
>> >> + =C2=A0 =C2=A0 =C2=A0 static type name __section(".ksvar"); =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \
>> >> + =C2=A0 =C2=A0 =C2=A0 static struct ksvar_set name ## _ksvar_set =3D=
 { =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\
>> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .idx =3D index, =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \
>> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .pksvar =3D (char =
*) &name =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0\
>> >> + =C2=A0 =C2=A0 =C2=A0 }; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\
>> >> + =C2=A0 =C2=A0 =C2=A0 DATA_SET(ksvar_set, name ## _ksvar_set)
>> >>
>> >> Every variable must have a unique index. The indexes must
>> >> start from zero and be consecutive. When you add an index
>> >> you must bump the size of the table (KSVAR_TABLE_SIZE)
>> >> (see sys/sys/ksvar.h)
>> >>
>> >> The variables are inside the kernel static image that isn't managed
>> >> by the VM and so we need to allocate pages to map the physical addres=
ses.
>> >> A new SYSINIT (ksvarinit) will allocate a set of vm_page_t =C2=A0thro=
ugh
>> >> the vm_phys_fictitious_reg_range interface and fill the table using
>> >> the information
>> >> of the ksvar_set linker set, then will create a vm_object_t (vm_objec=
t_ksvar),
>> >> mark the fake pages as valid and put them into it.
>> >> When a new process is created by exec(3) the vm_object_ksvar will be
>> >> mapped read-only into the process address space by vm_map_fixed routi=
ne
>> >> just before mapping the user stack. The address of mapping will be re=
corded
>> >> inside the new p_ksvar field of the struct proc.
>> >> This field will be exported through a sysctl to the user space proces=
ses.
>> >> In order to implement syscalls as user space routines, we have to fin=
d out the
>> >> mapped address of the kernel shared variables when the libc is mapped=
 into
>> >> the process. So I added a function marked with the attribute construc=
tor.
>> >> It will called before any code into user process and before any code =
inside
>> >> the libc.
>> >>
>> >> +__attribute((constructor)) void init_kernel_shared()
>> >> +{
>> >> + =C2=A0 =C2=A0 =C2=A0 int mib[2];
>> >> + =C2=A0 =C2=A0 =C2=A0 size_t len;
>> >> + =C2=A0 =C2=A0 =C2=A0 vm_offset_t ksvar_address;
>> >> +
>> >> + =C2=A0 =C2=A0 =C2=A0 mib[0] =3D CTL_KERN;
>> >> + =C2=A0 =C2=A0 =C2=A0 mib[1] =3D KERN_KSVAR;
>> >> + =C2=A0 =C2=A0 =C2=A0 len =3D sizeof(vm_offset_t);
>> >> + =C2=A0 =C2=A0 =C2=A0 if (__sysctl(mib, 2, (void *) &ksvar_address, =
&len, NULL, 0) !=3D -1)
>> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ksvar_table =3D (u=
int32_t *) ksvar_address;
>> >> +}
>> >>
>> >> Once the libc knows the address of the table it can access to the sha=
red
>> >> variables.
>> >>
>> >> Just as proof of concept I re-implemented gettimeofday(3) in user spa=
ce.
>> >> First of all I didn't remove the entry into the syscall.master, just =
renamed the
>> >> sys_gettimeofday. I need it for the fallback path.
>> >> In the kernel I introduced a struct wall_clock.
>> >>
>> >> +struct wall_clock
>> >> +{
>> >> + =C2=A0 =C2=A0 =C2=A0 struct timeval =C2=A0tv;
>> >> + =C2=A0 =C2=A0 =C2=A0 struct timezone tz;
>> >> +};
>> >>
>> >> The struct is exported through sys/sys/time.h header.
>> >> I defined a new kernel shared variable. To do so I added an index in
>> >> sys/sys/ksvar.h
>> >> WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1.
>> >> In the sys/kern/kern_clocksource.c
>> >>
>> >> +/* kernel shared variable for implmenting gettimeofday. */
>> >> +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
>> >>
>> >> Now we defined a shared variable at index WALL_CLOCK_INDEX of type
>> >> struct wall_clock and named wall_clock.
>> >> Inside handleevents I update the info exported by wall_clock.
>> >>
>> >> + =C2=A0 =C2=A0 =C2=A0 struct timeval tv;
>> >> +
>> >> + =C2=A0 =C2=A0 =C2=A0 /* update time for userspace gettimeofday */
>> >> + =C2=A0 =C2=A0 =C2=A0 microtime(&tv);
>> >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tv =3D tv;
>> >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_minuteswest =3D tz_minuteswes=
t;
>> >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_dsttime =3D tz_dsttime;
>> >>
>> >> Now, in libc we import the shared variable
>> >>
>> >> +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock);
>> >>
>> >> note that WALL_CLOCK_INDEX must be the same of the one defined
>> >> inside the kernel, and define a new function gettimeofday
>> >>
>> >> +int
>> >> +gettimeofday(struct timeval *tp, struct timezone *tzp)
>> >> +{
>> >> +
>> >> + =C2=A0 =C2=A0 =C2=A0 /* fallback to syscall if kernel doesn't expor=
t ksvar */
>> >> + =C2=A0 =C2=A0 =C2=A0 if (!KSVAR_IS_ACTIVE())
>> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return (sys_gettim=
eofday(tp, tzp));
>> >> +
>> >> + =C2=A0 =C2=A0 =C2=A0 if (tp !=3D NULL)
>> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tp =3D KSVAR(wall=
_clock)->tv;
>> >> + =C2=A0 =C2=A0 =C2=A0 if (tzp !=3D NULL)
>> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tzp =3D KSVAR(wal=
l_clock)->tz;
>> >> + =C2=A0 =C2=A0 =C2=A0 return (0);
>> >> +}
>> >>
>> >> Now when a process will call getimeofday, will call that function act=
ually.
>> >> If the process makes a lot of call to gettimeofday, we will see a
>> >> performance boost.
>> >> Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE)=
,
>> >> the function
>> >> fallback to call the actual syscall (sys_gettimeofday).
>> >>
>> >> Open tasks
>> >> - implement support for 32-bit emulated processes running in a 64-bit
>> >> environment.
>> >> - extend support to others arch
>> >> - implement more syscalls
>> >> - benchmarks
>> >> - Test, test, test.
>> >>
>> >> I'm looking forward to hear about your comments and suggestions.
>> >
>> > I very much dislike what you described, it makes ABI maintanence
>> > a nightmare.
>> > Below is some mail I wrote around Spring 2009, making some notes about
>> > desired proposal. This is what called vdso in Linux land.
>>
>> Did you bother to read at least Giovanni's description?
>> Because this has nothing to do with VDSO in Linux.
> Did you bothered to think shortly why do I object ?
>
>>
>> I think, he just wants to map in userland processes some pages from
>> the static image of the kernel (packed together in a specific
>> dataset). This imposes some non-trivial problem. The first thing is
>> that the static image is not thought to have physical pages tied to
>> it. The second is that he needs to make a clean design in order to let
>> consumer of this mechanism to correctly locate informations they want
>> within the shared page(s) and in the end read the correct values.
> Right, exactly, and this is why I object to the "offsets" approach.
> It basically moves us to the old times of the "jump tables" shared
> libraries, that fortunately was never a case for FreeBSD even when
> a.out was used.

I'm objecting to this either.

>>
>> I have some reservations on both the implementation and the approach
>> for retrieving datas from the page.
>> In particular, I don't like that a new vm_object is allocated for this
>> page. What I really would like would be:
>> 1) very minimal implementation -- you just use
>> pmap_enter()/pmap_remove() specifically when needed, separately, in
>> fork(), execve(), etc. cases
> Oh, this simply cannot work.

And why? Assuming you provide a vm_page_t from an UMA zone just like
fakepage do. Of course you cannot recycle for this purpose any page
caming from vm_page_alloc().

>> 2) more complete approach -- you make a very quick layer which let you
>> map pages from the static image of the kernel and the shared page
>> becomes just a specific consumer of this. This way the object has much
>> more sense because it becomes an object associated to all the static
>> image of the kernel
> So you want to circumvent the vm layer.

Note sure I agree with your opinion on this.

>>
>> About the layering, I don't like that you require both a kernel and
>> userland header to locate the objects within the page. This is very
>> likely ABI breakage prone. It is needed a mechanism for retrieving at
>> run time what Giovanni calls "indexes", or making it indexes-agnostic.
>
> And this is what VDSO is for. VDSO with the standard ELF symbol
> interposition rules allow to have libc that is completely unaware of the
> shared page and 'indexes', i.e. which works both for older kernel that
> do not export required index, and for new kernels that export the same
> information in some more advanced format. By having VDSO that exports
> e.g. gettimeofday() we would get override for libc gettimeofday, while
> having fully functional libc for other, future and past, kernels, even
> if the format of the data exported for super-fast gettimeofday changes.
>
> The tight between VDSO and kernel is not a problem, since VDSO is part
> of the kernel from the deployment POV. More. either existing ELF
> linker in kernel, or some trivial modifications to it, would allow
> to not use 'indexes' on the kernel side too.

I admit I don't have a better plan on how to retrieve objects from the
shared page at the moment, I didn't give much thought to it.

> We already have a shared page between kernel and whole set of the same-AB=
I
> processes. Currently it is used for signal trampolines only.
> The hard parts of the task is to provide VDSO build glue. Also IMO the
> hard task is to define sensible gettimeofday() implementation, probably
> using rdtsc in usermode. Shared page is easy, or at least it is already
> there without ugly and non-working vm hacks.
>
> As an additional note, already put by Bruce, the implementation of
> usermode gettimeofday is exactly opposite of any reasonable implementatio=
n.
> It looses the precision to the frequency of the event timer. Obvious
> approach is to not have any periodically updating data for gettimeofday
> purpose, and use some formula with rdtsc and kernel-provided coefficients
> on the machines where rdtsc is usable.

The gettimeofday() implementation is a different story than what is asked h=
ere.

> Interesting question is how much shared the shared page needs be.
> Obvious needs are shared between all same-ABI processes, but I can also
> easily see a need for the per-process private information be present in
> the 'private-shared' page. For silly but typical example, useful for
> moronix-style benchmarks, see getpid().

Really the performance benefits of having fast getpid() is marginal if
compared to heavilly used things like gettimeofday(). I cannot think
of a per-process page implementing a fast syscall that can bring many
perfomance advantages.

Attilio


--
Peace can only be achieved by understanding - A. Einstein


--=20
Peace can only be achieved by understanding - A. Einstein



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-FndCpztSWyJo2hRVs5qu%2BvQOj9E1mPBhfVOxM_OC2eNac6A>