From owner-freebsd-net@FreeBSD.ORG Mon Dec 8 08:47:54 2008 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DB1A7106564A for ; Mon, 8 Dec 2008 08:47:54 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outX.internet-mail-service.net (outx.internet-mail-service.net [216.240.47.247]) by mx1.freebsd.org (Postfix) with ESMTP id BEF168FC19 for ; Mon, 8 Dec 2008 08:47:54 +0000 (UTC) (envelope-from julian@elischer.org) Received: from idiom.com (mx0.idiom.com [216.240.32.160]) by out.internet-mail-service.net (Postfix) with ESMTP id 8F2CC24B6 for ; Mon, 8 Dec 2008 00:48:09 -0800 (PST) X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (home.elischer.org [216.240.48.38]) by idiom.com (Postfix) with ESMTP id 647292D601D for ; Mon, 8 Dec 2008 00:47:55 -0800 (PST) Message-ID: <493CDF3C.5030608@elischer.org> Date: Mon, 08 Dec 2008 00:47:56 -0800 From: Julian Elischer User-Agent: Thunderbird 2.0.0.18 (Macintosh/20081105) MIME-Version: 1.0 To: FreeBSD Net Content-Type: multipart/mixed; boundary="------------030705070707060304090701" Subject: Vimage howto X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Dec 2008 08:47:54 -0000 This is a multi-part message in MIME format. --------------030705070707060304090701 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Well not completely, but I've had a number of questions over the last few months about what it is, so, as Marko and I have written the following "how to virtualize your module" document, I've been directing people to it. After another couple of questions I think this could do with wider distribition.. It is available at: http://perforce.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/vimage/porting_to_vimage.txt but I include it here for popular enjoyment. Please contact me or Marko if you have any questions or suggestions on this. --------------030705070707060304090701 Content-Type: text/plain; name="porting_to_vimage.txt" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="porting_to_vimage.txt" =================== Vimage: what is it? =================== Vimage is a framework in the BSD kernel which allows a co-operating module to operate on multiple independent instances of its state so that it can participate in a virtual machine / virtual environment scenario. The implementation approach taken by the vimage framwork is a replacement of selected global state variables with constructs that allow for the virtualized state to be stored and resolved in appropriate instances of module-specific container structures. The code operating on virtualized state has to conform to a set of rules described further below, among other things in order to allow for all the changes to be conditionally compilable, i.e. permitting the virtualized code to fall back to operation on global state. The most visible change throughout the existing code is typically replacement of direct references to global variables with macros; foo_bar thus becomes V_foo_bar. V_foo_bar macros will resolve back to foo_bar global in default kernel builds, and alternatively to some_base_pointer->_foo_bar for "options VIMAGE" kernel configs. Prepending of "V_" prefixes to variable references helps in visual discrimination between global and virtualized state. The framework extends the sysctl infrastructure to support access to virtualized state through introduction of the SYSCTL_V family of macros; those also automatically fall back to their standard SYSCTL counterparts in default kernel builds. Transparent kldsym(2) lookups are provided to virtualized variables explicitly marked for visibility to kldsym interface, which permits userland binaries such as netstat to operate unmodified on "options VIMAGE" kernels, though this may have wide security implications. The vimage struct is currently primarily a placeholder for pointers to module-specific struct instances; currently V_NET (networking), V_CPU (CPU scheduling), and V_PROCG (jail-style interprocess protection) major module classes are defined. Each vimage module may or may not be further split into minor or submodules; the networking subsystem (vimage id V_NET; struct vnet) in particular is organized in submodules such as VNET_MOD_NET (mandatory shared infrastructure: routing tables, interface lists etc.); VNET_MOD_INET (IPv4 state including transport protocols); VNET_MOD_INET6, VNET_MOD_IPSEC, VNET_MOD_IPFW, VNET_MOD_NETGRAPH etc. The speciality of VNET submodules is in that they not only provide storage for virtualized data, but also enforce ordering of initialization and cleanup. Hence, not all submodules must necessarily allocate private storage for their specific data; they may be defined solely for to support proper initialization ordering. Each process is associated with a vimage, and vimages currently hang off of ucred-s. This relationship defines a process's administrative affinity to a vimage and thus indirectly to all of its modules (NET, CPU, PROCG) as well as to any submodules. All network interfaces and sockets hold pointers back to their parent vnets; this relationship is obviously entirely independent from proc->ucred->vimage bindings. Hence, when a process opens a socket, the socket will get bound to a vnet instance hanging off of proc->ucred->vimage->vnet, but once such a socket->vnet binding gets established, it cannot be changed for the entire socket lifetime. Certain classes of network interfaces (Ethernet in particular) can be assigned from one vnet to another at any time. By definition all vnets are are independent and can communicate only if they are explicitly provided with communication paths; currently only netgraph can be used to establish inter-vnet datapaths. In network traffic processing the vnet affinity is defined either by the inbound interface or by the socket / pcb -> vnet binding. However, there are many functions in the network stack that cannot implicitly fetch the vnet context from their standard arguments. Instead of explicitly extending argument lists of such functions with a struct vnet *, a per-thread variable td_vnet was introduced, which can be fetched via the curvnet macro (#define curvnet curthread->td_vnet). The curvnet context has to be set on entry to the network stack (socket operations, packet reception, or timer-driven functions) and cleared on exit. This must be done via provided CURVNET_SET() / CURVNET_RESTORE() family of macros, which allow for "stacking" of curvnet context setting and provide additional debugging info in INVARIANTS kernel configs. In most cases however a developer writing virtualized code will not have to set / restore the curvnet context unless the code would include timer-driven events, given that those are inherently vnet-contextless on entry. Converting / virtualizing existing code ======================================= There are several steps need in virtualisation. 1/ decide whether the module needs to be virtualised. if the module is a driver for specific hardware, it makes sense that there be only one instance of the driver as there is only one piece of physical hardware. There are changes in the networking code to allow physical (or virtual) interfaces to be moved between vnets. This generally requires NO changes to the network drivers of the classes covered (e.g. ethernet). 2/ decide if your module is part of one of the major module groups. These are currently V_NET V_PROCG V_CPU. The reader will note that the descriptions below use the acronym VNET a lot. The vimage system has been at this time broken into a number of subsections. One of these is the "VNET" group. The idea of these subsections is that they might be individually selected as virtualizable in a particular virtual machine instance. As an example, in a virtualization, one might to allocate a couple of processors to it, but keep the same filesystem and network setup, or alternatively to share processors but to have virtualised networking. 3/ If the module is to be virtualised, decide which attributes of the module should be virtualised. For example, It may make sense that there be a single central pool of "struct foo" and a single uma zone for them to come from, with a single lock guarding it. It might also make sense if the "foo_debug" sysctl controls all the instances at once, while on the other hand, the "foo_mode" sysctl might make better sense if it were controllable on a virtual system by virtual system basis. 4/ Work out what global variables and structures are to be virtualised to achieve the behaviour required for part #3. 5/ Work out for all the code paths through the module, how the path entering the module can divine which virtual environment it is on. Some examples: * Since interfaces are all assigned to one vnet or another, an incoming packet has a pointer to the receive interface, which in turn has a pointer back to the vnet. Often "curvnet" will already have been set by the time your code is called anyhow. * Similarly, on any request from outside the kernel, (direct or indirect) the current thread has a way to get to the current virtual environment instance via td->ucred->vimage. For existing sockets the vnet context must be used via so->so_vnet since td->ucred->vimage might change after socket creation. * Timer initiated actions usually have a (void *) argument which points to some private structure for the module. It should be possible to add a pointer to the appropriate module instance into whatever structure that points to. * Sometimes an action (timer trigerred or trigerred by module load or unload simply has to check all the vimage or module instances. There are macro (pairs) for this which will iterate through all the VNET or VPROCG instances. This covers most of the cases, however in some cases it may still be required for the module to stash away the virtual environment instance somewhere, and make associated changes in the code. 6/ Add the code described below to the files that make up the module Details: temp. note: for module FOO add a definition for VNET_MOD_FOO in sys/vimage.h. This will eventually be dynamically assigned. For now these instructions refer mainly to VNET and not VCPU, VPROCG etc. Symbols defined in other modules that have been virtualised will have been moved to a module-specific virtualisation structure. It will be defined in a .h file for just this purpose. If a module will never export virtualise symbols beyond it's borders, then this structure may well just be in a common include file for that module. As an example, common networking (but not protocol) variables have been moved to a file called net/vnet.h, but the gre module has simply added the virtualisation structure to if_gre.h as no code outside the gre interface will access those values. Accesses to virtualised symbols are achieved via macros, which generally are of the same name as the original symbol but with a "V_" prepended, thus the head of the interface list, called 'ifnet' is replaced whereever used with "V_ifnet". In SCTP, because the code is shared with other OS's they are replaced with a macro MODULE_GLOBAL(modulename, symbol). In the current version of vimage, when VIMAGE is not compiled into the kernel, the macros evaluate to a direct reference to the symbol. In future versions it will evaluate to a global version of the virtualisation structure with the offset to the entry in quesiton, which will result in a single direct memory reference, so that the speed will be as it is now. When VIMAGE is compiled in, the macro will evaluate to an access to an element in a structure pointed to by a local varible. For this reason, it is necessary to also add, at the beginning of these functions another macro that will instantiate this local variable and point it at the correct place. As an example, prior to using the "V_ifnet" structure in a program block, we must add the following macro at the head of a code block enclosing the references to set up module-specific base pointer variable: INIT_VNET_NET(initial_value); /* initial value is usually curvnet */ When VIMAGE is not defined, this will evaluate to nothing but when it IS defined, it will evaluate to: struct vnet_net *vnet_net = (initial_value); The initial value is usually something like "curvnet" which in turn is a macro that derives the vnet affinity from the current thread. It could also be (m->m_ifp->if_vnet) if we were receiving an mbuf. In the case where it is just one function in a module calling another (static), the porter might decide to simply pass the local variable as an argument, rather than to reevaluate it in the function, but should be prepared to cope with the fact that the code might be compiled in the "no-VIMAGE" manner (in which case the argument would be marked as "unused"). Usually, when a packet enters the system it is carried through the processing path via a single thread, and that thread will set its virtual environment reference to that indicated by the packet on picking up that new packet. This means that in the normal inbound processing path as well as the outgoing process path the current thread can be used to indicate the current virtual environment. In the case of timer initiated events, best practice would also be to set the current virtual module reference to that indicated calculated by whatever way that would be done, so that any functions called could rely on the current thread being a good reference for the correct virtual module. When a new VNET submodule is defined for virtualisation, the following structure defining macro is used to define it to the framework. #define VNET_MOD_DECLARE(m_name_uc, m_name_lc, m_iattach, m_idetach, \ m_dependson, m_symmap) \ static const struct vnet_modinfo vnet_##m_name_lc##_modinfo = { \ .vmi_id = VNET_MOD_##m_name_uc, \ .vmi_dependson = VNET_MOD_##m_dependson, \ .vmi_name = #m_name_lc, \ .vmi_iattach = m_iattach, \ .vmi_idetach = m_idetach, \ .vmi_struct_size = \ sizeof(struct vnet_##m_name_lc), \ .vmi_symmap = m_symmap \ The ID we allocated in the temporary first step in "Details" is the first entry here; eventually this should be automatically done by module name. The DEPENDSON field tells us the order that modules should be initialised in a new virtual environment. This may later need to be changed to a list of text module names for dynamic calculation. The rest of the fields are self explanatory, with the exception of the symmap entry. The symmap allows us to intercept calls by libkvm to the linker when it is looking up symbols and to redirect it dynamically. this allows for example "netstat -r" to find the routing tables for THIS virtual environment. (of course that won't work for core dumps). (XXX *needs thought *) As example of virtualising a dummy module named the FOO module the following code might be added to a special vfoo.h or at least to the exisitng foo.h file: ======================================================== #ifndef _DIR_VFOO_H_ #define _DIR_VFOO_H_ #include /* for struct foo_bar */ #define INIT_VNET_FOO(vnet) \ INIT_FROM_VNET(vnet, VNET_MOD_FOO, \ struct vnet_foo, vnet_foo) #define VNET_FOO(sym) VSYM(vnet_foo, sym) #if (defined(VIMAGE) || defined(FUTURE)) struct vnet_foo { int _foo_counter struct foo_bar _foo_barx; }; #endif /* Symbol translation macros */ #define V_foo_counter VNET_FOO(foo_counter) #define V_foo_barx VNET_FOO(foo_barx) #endif /* !_FOO_VFOO_H_ */ ========================================================= For each time the foo module is initiated for a new virtual environment, the foo_bar structure must be initiated, so a new foo_creator and destructor functions are defined for the module. The Module will call these when a new virtual environment is created or destroyed. The constructor must be called once for the base machine when the system is booted, even when options VIMAGE is not defined. ==================== in module foo.c ====== #include "opt_vimage.h" [...] #include [...] #include [...] #ifndef VIMAGE /* initially the globals would have been here, * and for now we will leave them here when not using VIMAGE. * In the future we will instead have a static version of the structure. */ # if defined(FUTURE) struct vnet_foo vnet_foo_globals; # else /* !FUTURE */ int foo_counter = 0; struct foo_bar foo_barx = {}; # endif /* !FUTURE */ #endif /* !VIMAGE */ [...] #if (defined(VIMAGE) || defined(FUTURE)) static vnet_attach_fn vnet_foo_iattach; static vnet_detach_fn vnet_foo_idetach; #endif #ifdef VIMAGE /* If we have symbols we need to divert for libkvm * then put them in here. We may not need to do anything if * the symbols are not used by libkvm. */ static struct vnet_symmap vnet_net_symmap[] = { VNET_SYMMAP(foo, foo_counter), VNET_SYMMAP(foo, foo_barx), VNET_SYMMAP_END }; /* * Declare our module and state that we want to be done after the * loopback interface is initialised for the virtual environment. */ VNET_MOD_DECLARE(FOO, foo, vnet_foo_iattach, vnet_foo_idetach, LOIF, vnet_foo_symmap) #endif /* VIMAGE */ [...] /* a pre-exisiting 'foo' function that will be converted. */ void foo_work(void) { INIT_VNET_FOO(curvnet); /* Add this at the front */ V_foo_counter++; /* add "V_" to the front of the symbol */ [...] V_foo_barx.mumble = V_foo_counter; /* and here too */ [...] } /* * A function which on entry has no idea of which vnet it is on * and needs to look at them all for some reason. * NOTE! if this code is running in a thread that * does nothing else, or otherwise doesn't care about which * vnet it is on then the steps that save and restore the previous vnet * need not be done. (Marked with /* XXX */) */ void foo_tick(void) { VNET_ITERATOR_DECL(vnet_iter); [...] [...] VNET_LIST_RLOCK(); VNET_LIST_FOREACH(vnet_iter) { CURVNET_SET(vnet_iter); INIT_VNET_NET(vnet_iter); [...] do work, including calling code that assumes we have curvnet set. [...] CURVNET_RESTORE(); } VNET_LIST_RUNLOCK(); [...] } #if (defined(VIMAGE) || defined(FUTURE)) static int vnet_foo_iattach(const void *unused) { INIT_VNET_FOO(curvnet); V_foo_counter = 0; bzero (&V_foo_barx, sizeof (V_foo_barx)); return 0; } #endif #ifdef VIMAGE static int vnet_foo_idetach(const void *unused) { INIT_VNET_FOO(curvnet); /* prove we are ready to remove the module */ /* code here to do work required */ return 0; } #endif /* VIMAGE */ /* * Handle loading and unloading for this code. * The only thing we need to link into is the NETISR strucure. */ static int foo_mod_event(module_t mod, int event, void *data) { int error = 0; switch (event) { case MOD_LOAD: /* Initialize everything. */ /* put your code here */ #ifdef VIMAGE /* This will do the work for each vortual environment. */ vnet_mod_register(&vnet_foo_modinfo); #else /* !VIMAGE */ #ifdef FUTURE /* otherwise do the initialisation directly */ vnet_foo_iattach(NULL); #else /* !FUTURE */ /* otherwise the intialisation is done statically */ #endif /* !FUTURE */ #endif /* !VIMAGE */ break; case MOD_UNLOAD: /* You can't unload it because an interface may be using it. */ /* this needs work */ /* Should refuse to unload if any virtual environment */ /* are using this still. */ /* MARKO, fill in here */ error = EBUSY; break; default: error = EOPNOTSUPP; break; } return (error); } --------------030705070707060304090701--