From owner-freebsd-arch Wed Oct 23 17:31:31 2002 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4D27337B401; Wed, 23 Oct 2002 17:31:27 -0700 (PDT) Received: from inje.iskon.hr (inje.iskon.hr [213.191.128.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 42E5943E3B; Wed, 23 Oct 2002 17:31:25 -0700 (PDT) (envelope-from zec@tel.fer.hr) Received: from tel.fer.hr (zg05-053.dialin.iskon.hr [213.191.138.54]) by mail.iskon.hr (8.11.4/8.11.4/Iskon 8.11.3-1) with ESMTP id g9O0UIE24123; Thu, 24 Oct 2002 02:30:21 +0200 (MEST) Message-ID: <3DB73F31.5F02609C@tel.fer.hr> Date: Thu, 24 Oct 2002 02:30:41 +0200 From: Marko Zec X-Mailer: Mozilla 4.8 [en] (Windows NT 5.0; U) X-Accept-Language: en MIME-Version: 1.0 To: Julian Elischer Cc: "J. 'LoneWolf' Mattsson" , freebsd-net@freebsd.org, freebsd-arch@freebsd.org Subject: Re: RFC: BSD network stack virtualization References: Content-Type: text/plain; charset=iso-8859-2 Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Julian Elischer wrote: > I'm very impressed. I do however have some questions. > (I have not read the code yet, just the writeup) > > 1/ How do you cope with each machine expecting to have it's own loopback > interface? Is it sufficient to make lo1 lo2 lo3 etc. and attache them > to the appropriate VMs? The creation of "lo0" interface is done "automatically" at the time of creation of each new virtual image instance. The BSD networking code generally assumes the "lo0" interface exists all the times, so it just wouldn't be possible to create a network stack instance without a unique "lo" ifc. > 2/ How much would be gained (i.e. is it worth it) to combine this with > jail? Can you combine them? (does it work?) Does it make sense? Userland separation (hiding) between processes running in different virtual images is actually accomplished by reusing the jail framework. The networking work is completely free of jail legacy. Although I didn't test it, I'm 100% sure it should be possible to run multiple jails inside each of virtual images. For me it doesn't make too much sense though, that's why I didn't bother with testing... > 3/ You implemented this in 4.x which means that we need to reimplement > it in -current before it has any chance of being 'included'. Do you > think that would be abig problem? I must admit that I do not follow the development in -current, so it's hard to tell how much the network stacks have diverted in the areas affected by virtualization work. My initial intention was to "polish" the virtualization code on the 4.x platform - there are still some major chunks of coding yet to be done, such as removal of virtual images, and patching of IPv6 and IPSEC code. Hopefully this will be in sync with the release of 5.0, so than I can spend some time porting it to -current. However, if reasonable demand is created, I'm prepared to revise that approach... > 5/ Does inclusion of the virtualisation have any measurable effect on > throughputs for systems that are NOT using virtualisation. In other > words, does the non Virtualised code-path get much extra work? (doi you > have numbers?) (i.e. does it cost much for the OTHER users if we > incorporated this into FreeBSD?) The first thing in my pipeline is doing decent performance/throughput measurements, but these days I just cannot find enough spare time for doing that properly (I still have a daytime job...). The preliminary netperf tests show around 1-2% drop in maximum TCP throughput on loopback with 1500 bytes MTU, so in my opinion this is really a neglectable penalty. Of course, the applications limited by media speed won't experience any throughput degradation, except probably hardly measurable increase in CPU time spent in interrupt context. > 6/ I think that your ng_dummy node is cute.. > can I commit it separatly? (after porting it to -current..) Actually, this code is ugly, as I was stupid enough to invent my own queue management methods, instead of using the existing ones. However, from the user perspective the code seems to work without major problems, so if you want to commit it I would be glad... > 7/ the vmware image is a great idea. > > 8/ can you elaborate on the following: > * hiding of "foreign" filesystem mounts within chrooted virtual images Here is an self-explaining example of hiding "foreign" filesystem mounts: tpx30# vimage -c test1 chroot /opt/chrooted_vimage tpx30# mount /dev/ad0s1a on / (ufs, local, noatime) /dev/ad0s1g on /opt (ufs, local, noatime, soft-updates) /dev/ad0s1f on /usr (ufs, local, noatime, soft-updates) /dev/ad0s1e on /var (ufs, local, noatime, soft-updates) mfs:22 on /tmp (mfs, asynchronous, local, noatime) /dev/ad0s2 on /dos/c (msdos, local) /dev/ad0s5 on /dos/d (msdos, local) /dev/ad0s6 on /dos/e (msdos, local) procfs on /proc (procfs, local) procfs on /opt/chrooted_vimage/proc (procfs, local) /usr on /opt/chrooted_vimage/usr (null, local, read-only) tpx30# vimage test1 Switched to vimage test1 %mount procfs on /proc (procfs, local) /usr on /usr (null, local, read-only) % > 9/ how does VIPA differ from the JAIL address binding? Actually, VIPA feature should be considered completely independent of network stack virtualization work. The jail address is usually bound to an alias address configured on a physical interface. When this interface goes down, all the connections using this address drop dead instantly. VIPA is a loopback-type internal interface that will always remain up regardless of the physical network topology changes. If the system has multiple physical interfaces, and an alternative path can be established following an NIC or network route outage, the connections bound to VIPA will survive. Anyhow, the idea is borrowed from IBM's OS/390 TCP/IP implementation, so you can find more on this concept on http://www-1.ibm.com/servers/eserver/zseries/networking/vipa.html > 10/ could you use ng_eiface instead of if_ve? Most probably yes, but my system crashed each time when trying to configure ng_eiface, so I just took another path and constructed my own stub ethernet interface... > 11/ why was ng_bridge unsuitable for your use? Both the native and netgraph bridging code, I believe, were designed with the presumption that only one "upper" hook is really needed to establish the communication to kernel's single network stack. However, this concept doesn't hold on a kernel with multiple network stacks. I just picked the native bridging code first and extended it to support hooking to multiple "upper" hooks. The similar extensions have yet to be applied to ng_bridge, I just didn't have time to implement the same functionality in two different frameworks. > 12/ can you elaborate on the following: > # fix netgraph interface node naming > # fix the bugs in base networking code (statistics in > "native" bridging, additional logic for ng_bridge...) When the interface is moved to a different virtual image, it's unit number gets reassigned, so the interface that was named say "vlan3" in the master virtual image, will become "vlan0" when assigned to the child. The same thing happens when the child virtual image returns the interface back to its parent. The naming of netgraph nodes associated with interfaces (ng_iface, ng_ether) should be updated accordingly, which is currently not done. I also considered virtualizing a netgraph stack, this would be also very cool if each virtual image could manage its own netgraph tree. However, when weighting implementation priorities, I concluded that this was something that could wait for other more basic things to be reworked properly first. Therefore in the current implementation it is possible to manage the netgraph subsystem only from the "master" virtual image. Hope this was enough elaboration for actually testing the code :) Have fun, Marko To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message