From owner-freebsd-arch  Wed Oct 23 17:31:31 2002
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 4D27337B401; Wed, 23 Oct 2002 17:31:27 -0700 (PDT)
Received: from inje.iskon.hr (inje.iskon.hr [213.191.128.16])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 42E5943E3B; Wed, 23 Oct 2002 17:31:25 -0700 (PDT)
	(envelope-from zec@tel.fer.hr)
Received: from tel.fer.hr (zg05-053.dialin.iskon.hr [213.191.138.54])
        by mail.iskon.hr (8.11.4/8.11.4/Iskon 8.11.3-1) with ESMTP id g9O0UIE24123;
	Thu, 24 Oct 2002 02:30:21 +0200 (MEST)
Message-ID: <3DB73F31.5F02609C@tel.fer.hr>
Date: Thu, 24 Oct 2002 02:30:41 +0200
From: Marko Zec <zec@tel.fer.hr>
X-Mailer: Mozilla 4.8 [en] (Windows NT 5.0; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Julian Elischer <julian@elischer.org>
Cc: "J. 'LoneWolf' Mattsson" <lonewolf-freebsd@earthmagic.org>,
	freebsd-net@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: RFC: BSD network stack virtualization
References: <Pine.BSF.4.21.0210231503010.36940-100000@InterJet.elischer.org>
Content-Type: text/plain; charset=iso-8859-2
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Julian Elischer wrote:

> I'm very impressed. I do however have some questions.
> (I have not read the code yet, just the writeup)
>
> 1/ How do you cope with each machine expecting to have it's own loopback
> interface?  Is it sufficient to make lo1 lo2 lo3 etc. and attache them
> to the appropriate VMs?

The creation of "lo0" interface is done "automatically" at the time of
creation of each new virtual image instance. The BSD networking code
generally assumes the "lo0" interface exists all the times, so it just
wouldn't be possible to create a network stack instance without a unique
"lo" ifc.

> 2/ How much would be gained (i.e. is it worth it) to combine this with
> jail?  Can you combine them? (does it work?) Does it make sense?

Userland separation (hiding) between processes running in different virtual
images is actually accomplished by reusing the jail framework. The
networking work is completely free of jail legacy. Although I didn't test
it, I'm 100% sure it should be possible to run multiple jails inside each of
virtual images. For me it doesn't make too much sense though, that's why I
didn't bother with testing...

> 3/ You implemented this in 4.x which means that we need to reimplement
> it in -current before it has any chance of being 'included'. Do you
> think that would be abig problem?

I must admit that I do not follow the development in -current, so it's hard
to tell how much the network stacks have diverted in the areas affected by
virtualization work. My initial intention was to "polish" the virtualization
code on the 4.x platform - there are still some major chunks of coding yet
to be done, such as removal of virtual images, and patching of IPv6 and
IPSEC code. Hopefully this will be in sync with the release of 5.0, so than
I can spend some time porting it to -current. However, if reasonable demand
is created, I'm prepared to revise that approach...

> 5/ Does inclusion of the virtualisation have any measurable effect on
> throughputs for systems that are NOT using virtualisation. In other
> words, does the non Virtualised code-path get much extra work? (doi you
> have numbers?) (i.e. does it cost much for the OTHER users if we
> incorporated this into FreeBSD?)

The first thing in my pipeline is doing decent performance/throughput
measurements, but these days I just cannot find enough spare time for doing
that properly (I still have a daytime job...). The preliminary netperf tests
show around 1-2% drop in maximum TCP throughput on loopback with 1500 bytes
MTU, so in my opinion this is really a neglectable penalty. Of course, the
applications limited by media speed won't experience any throughput
degradation, except probably hardly measurable increase in CPU time spent in
interrupt context.

> 6/ I think that your ng_dummy node is cute..
> can I commit it separatly? (after porting it to -current..)

Actually, this code is ugly, as I was stupid enough to invent my own queue
management methods, instead of using the existing ones. However, from the
user perspective the code seems to work without major problems, so if you
want to commit it I would be glad...

> 7/ the vmware image is a great idea.
>
> 8/ can you elaborate on the following:
>   * hiding of "foreign" filesystem mounts within chrooted virtual images

Here is an self-explaining example of hiding "foreign" filesystem mounts:

tpx30# vimage -c test1 chroot /opt/chrooted_vimage
tpx30# mount
/dev/ad0s1a on / (ufs, local, noatime)
/dev/ad0s1g on /opt (ufs, local, noatime, soft-updates)
/dev/ad0s1f on /usr (ufs, local, noatime, soft-updates)
/dev/ad0s1e on /var (ufs, local, noatime, soft-updates)
mfs:22 on /tmp (mfs, asynchronous, local, noatime)
/dev/ad0s2 on /dos/c (msdos, local)
/dev/ad0s5 on /dos/d (msdos, local)
/dev/ad0s6 on /dos/e (msdos, local)
procfs on /proc (procfs, local)
procfs on /opt/chrooted_vimage/proc (procfs, local)
/usr on /opt/chrooted_vimage/usr (null, local, read-only)
tpx30# vimage test1
Switched to vimage test1
%mount
procfs on /proc (procfs, local)
/usr on /usr (null, local, read-only)
%

> 9/ how does VIPA differ from the JAIL address binding?

Actually, VIPA feature should be considered completely independent of
network stack virtualization work. The jail address is usually bound to an
alias address configured on a physical interface. When this interface goes
down, all the connections using this address drop dead instantly. VIPA is a
loopback-type internal interface that will always remain up regardless of
the physical network topology changes. If the system has multiple physical
interfaces, and an alternative path can be established following an NIC or
network route outage, the connections bound to VIPA will survive. Anyhow,
the idea is borrowed from IBM's OS/390 TCP/IP implementation, so you can
find more on this concept on
http://www-1.ibm.com/servers/eserver/zseries/networking/vipa.html

> 10/ could you use ng_eiface instead of if_ve?

Most probably yes, but my system crashed each time when trying to configure
ng_eiface, so I just took another path and constructed my own stub ethernet
interface...

> 11/ why was ng_bridge unsuitable for your use?

Both the native and netgraph bridging code, I believe, were designed with
the presumption that only one "upper" hook is really needed to establish the
communication to kernel's single network stack. However, this concept
doesn't hold on a kernel with multiple network stacks. I just picked the
native bridging code first and extended it to support hooking to multiple
"upper" hooks. The similar extensions have yet to be applied to ng_bridge, I
just didn't have time to implement the same functionality in two different
frameworks.

> 12/ can you elaborate on the following:
>   # fix netgraph interface node naming
>   # fix the bugs in base networking code (statistics in
>     "native" bridging, additional logic for ng_bridge...)

When the interface is moved to a different virtual image, it's unit number
gets reassigned, so the interface that was named say "vlan3" in the master
virtual image, will become "vlan0" when assigned to the child. The same
thing happens when the child virtual image returns the interface back to its
parent. The naming of netgraph nodes associated with interfaces (ng_iface,
ng_ether) should be updated accordingly, which is currently not done.
I also considered virtualizing a netgraph stack, this would be also very
cool if each virtual image could manage its own netgraph tree. However, when
weighting implementation priorities, I concluded that this was something
that could wait for other more basic things to be reworked properly first.
Therefore in the current implementation it is possible to manage the
netgraph subsystem only from the "master" virtual image.

Hope this was enough elaboration for actually testing the code :)

Have fun,

Marko


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message