From owner-freebsd-fs@freebsd.org Thu Aug 27 19:53:46 2015 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id AEDE59C4BA7; Thu, 27 Aug 2015 19:53:46 +0000 (UTC) (envelope-from milios@ccsys.com) Received: from cargobay.net (cargobay.net [198.178.123.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 8B5E81FB8; Thu, 27 Aug 2015 19:53:46 +0000 (UTC) (envelope-from milios@ccsys.com) Received: from [192.168.0.2] (cblmdm72-240-160-19.buckeyecom.net [72.240.160.19]) by cargobay.net (Postfix) with ESMTPSA id 0D69AD31; Thu, 27 Aug 2015 19:49:59 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\)) Subject: Re: Options for zfs inside a VM backed by zfs on the host From: "Chad J. Milios" In-Reply-To: <55DF46F5.4070406@redbarn.org> Date: Thu, 27 Aug 2015 15:53:42 -0400 Cc: Matt Churchyard , Vick Khera , allanjude@freebsd.org, "freebsd-virtualization@freebsd.org" , freebsd-fs@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <453A5A6F-E347-41AE-8CBC-9E0F4DA49D38@ccsys.com> References: <20150827061044.GA10221@blazingdot.com> <20150827062015.GA10272@blazingdot.com> <1a6745e27d184bb99eca7fdbdc90c8b5@SERVER.ad.usd-group.com> <55DF46F5.4070406@redbarn.org> To: Paul Vixie X-Mailer: Apple Mail (2.2104) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Aug 2015 19:53:46 -0000 > On Aug 27, 2015, at 10:46 AM, Allan Jude = wrote: >=20 > On 2015-08-27 02:10, Marcus Reid wrote: >> On Wed, Aug 26, 2015 at 05:25:52PM -0400, Vick Khera wrote: >>> I'm running FreeBSD inside a VM that is providing the virtual disks = backed >>> by several ZFS zvols on the host. I want to run ZFS on the VM itself = too >>> for simplified management and backup purposes. >>>=20 >>> The question I have is on the VM guest, do I really need to run a = raid-z or >>> mirror or can I just use a single virtual disk (or even a stripe)? = Given >>> that the underlying storage for the virtual disk is a zvol on a = raid-z >>> there should not really be too much worry for data corruption, I = would >>> think. It would be equivalent to using a hardware raid for each = component >>> of my zfs pool. >>>=20 >>> Opinions? Preferably well-reasoned ones. :) >>=20 >> This is a frustrating situation, because none of the options that I = can >> think of look particularly appealing. Single-vdev pools would be the >> best option, your redundancy is already taken care of by the host's >> pool. The overhead of checksumming, etc. twice is probably not super >> bad. However, having the ARC eating up lots of memory twice seems >> pretty bletcherous. You can probably do some tuning to reduce that, = but >> I never liked tuning the ARC much. >>=20 >> All the nice features ZFS brings to the table is hard to give up once >> you get used to having them around, so I understand your quandry. >>=20 >> Marcus >=20 > You can just: >=20 > zfs set primarycache=3Dmetadata poolname >=20 > And it will only cache metadata in the ARC inside the VM, and avoid > caching data blocks, which will be cached outside the VM. You could = even > turn the primarycache off entirely. >=20 > --=20 > Allan Jude > On Aug 27, 2015, at 1:20 PM, Paul Vixie wrote: >=20 > let me ask a related question: i'm using FFS in the guest, zvol on the > host. should i be telling my guest kernel to not bother with an FFS > buffer cache at all, or to use a smaller one, or what? Whether we are talking ffs, ntfs or zpool atop zvol, unfortunately there = are really no simple answers. You must consider your use case, the host = and vm hardware/software configuration, perform meaningful benchmarks = and, if you care about data integrity, thorough tests of the likely = failure modes (all far more easily said than done). I=E2=80=99m curious = to hear more about your use case(s) and setups so as to offer better = insight on what alternatives may make more/less sense for you. = Performance needs? Are you striving for lower individual latency or = higher combined throughput? How critical are integrity and availability? = How do you prefer your backup routine? Do you handle that in guest or = host? Want features like dedup and/or L2ARC up in the mix? (Then = everything bears reconsideration, just about triple your research and = testing efforts.) Sorry, I=E2=80=99m really not trying to scare anyone away from ZFS. It = is awesome and capable of providing amazing solutions with very reliable = and sensible behavior if handled with due respect, fear, monitoring and = upkeep. :) There are cases to be made for caching [meta-]data in the child, in the = parent, checksumming in the child/parent/both, compressing in the = child/parent. I believe `gstat` along with your custom-made benchmark or = test load will greatly help guide you. ZFS on ZFS seems to be a hardly studied, seldom reported, never = documented, tedious exercise. Prepare for accelerated greying and = balding of your hair. The parent's volblocksize, child's ashift, = alignment, interactions involving raidz stripes (if used) can lead to = problems from slightly decreased performance and storage efficiency to = pathological write amplification within ZFS, performance and = responsiveness crashing and sinking to the bottom of the ocean. Some = datasets can become veritable black holes to vfs system calls. You may = see ZFS reporting elusive errors, deadlocking or panicing in the child = or parent altogether. With diligence though, stable and performant = setups can be discovered for many production situations. For example, for a zpool (whether used by a VM or not, locally, thru = iscsi, ggate[cd], or whatever) atop zvol which sits on parent zpool with = no redundancy, I would set primarycache=3Dmetadata checksum=3Doff = compression=3Doff for the zvol(s) on the host(s) and for the most part = just use the same zpool settings and sysctl tunings in the VM (or child = zpool, whatever role it may conduct) that i would otherwise use on bare = cpu and bare drives (defaults + compression=3Dlz4 atime=3Doff). However, = that simple case is likely not yours. With ufs/ffs/ntfs/ext4 and most other filesystems atop a zvol i use = checksums on the parent zvol, and compression too if the child doesn=E2=80= =99t support it (as ntfs can), but still caching only metadata on the = host and letting the child vm/fs cache real data. My use case involves charging customers for their memory use so = admittedly that is one motivating factor, LOL. Plus, i certainly don=E2=80= =99t want one rude VM marching through host ARC unfairly evacuating and = starving the other polite neighbors. VM=E2=80=99s swap space becomes another consideration and I treat it = like any other =E2=80=98dumb=E2=80=99 filesystem with compression and = checksumming done by the parent but recent versions of many operating = systems may be paging out only already compressed data, so investigate = your guest OS. I=E2=80=99ve found lz4=E2=80=99s claims of an = almost-no-penalty early-abort to be vastly overstated when dealing with = zvols, small block sizes and high throughput so if you can be certain = you=E2=80=99ll be dealing with only compressed data then turn it off. = For the virtual memory pagers in most current-day OS=E2=80=99s though = set compression on the swap=E2=80=99s backing zvol to lz4. Another factor is the ZIL. One VM can hoard your synchronous write = performance. Solutions are beyond the scope of this already-too-long = email :) but I=E2=80=99d be happy to elaborate if queried. And then there=E2=80=99s always netbooting guests from NFS mounts served = by the host and giving the guest no virtual disks, don=E2=80=99t forget = to consider that option. Hope this provokes some fruitful ideas for you. Glad to philosophize = about ZFS setups with ya=E2=80=99ll :) -chad=