From owner-freebsd-hackers@freebsd.org Tue Dec 8 15:43:35 2015 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 47B859D453D for ; Tue, 8 Dec 2015 15:43:35 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-qg0-x231.google.com (mail-qg0-x231.google.com [IPv6:2607:f8b0:400d:c04::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EECBC1482 for ; Tue, 8 Dec 2015 15:43:34 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by qgcc31 with SMTP id c31so21462425qgc.3 for ; Tue, 08 Dec 2015 07:43:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=XtrFqTMrE4RjoyBsdvti78NOB67bGaVhCPdf9wyN+Wk=; b=wqtzMFkfOtGoIO0zGfB4QGY/s31YSQmApwFxsZmxG8BvnR5Vdv3OCQRwRqmJQCrLZp IAd0CVWniVO/7izJ9ULuNsV4NdA/sftDY3i8L8BZClJjvo2aO5xKvGU6eTGN8cgqjVQZ NHWgcQSpTloLIygCyvdyI86SAPF8velLsPphzUSZN3S/rRwkzfJK/HXFa90+epKfqYTS u2M7GuufDlX3OFfXh/qKZX9FKMWVr5dAWmghVGbdPQpNylugTnyWPe2eKeWSjOwMXNHf cq7tX09j0rh1dq+ldJmyCTX1VoLO4qTkinGyA8wYdSzULWWCwf81/yzbX5V+O6UgG/QQ makA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=XtrFqTMrE4RjoyBsdvti78NOB67bGaVhCPdf9wyN+Wk=; b=Ib1VMo20ekm0Ph3dB4e5ueNiFu5/Zpcr6rYp3Rt0oukeTIAvqe4z7Q1gP9SIwkLF81 TxXtbc1QhOs6plRBazB0z08FSkZ4d+WwH2uE8AT3L3D95BuYNPxqyZVMK9Nvow5Cx1Ms 9g8hm++m85WVwZDKp5SOFW8N++gp10zvukFukm4A031o+Tez3XPqiAL6w5ovqo0hsFZl yJXTU23ju73+hhoHsjxVhIazWdayCT3LrelOpRWMl/pfdMmft389XRwRrUFSowHo3QT1 vRubx0gx7uoKNu8hNk2WLzgQ0dDtcthKwwEcKJJoP4Lb8Rk6VcPQiwz/DvnDhwjCe7rg mAEA== X-Gm-Message-State: ALoCoQnXbIcXEMQk3zseg0MrHedPqXOldkNFfW08JXFE/hOci8jhBag2YF4Z3oyVAyXgx2ydY8t4pkDupJ4TIg/FWXiEs4qflA== MIME-Version: 1.0 X-Received: by 10.55.81.11 with SMTP id f11mr326490qkb.10.1449589413218; Tue, 08 Dec 2015 07:43:33 -0800 (PST) Sender: wlosh@bsdimp.com Received: by 10.140.27.181 with HTTP; Tue, 8 Dec 2015 07:43:33 -0800 (PST) X-Originating-IP: [2601:280:4900:3700:4d3f:8eba:ea86:7700] In-Reply-To: References: <201512052002.tB5K2ZEA026540@chez.mckusick.com> <86poyhqsdh.fsf@desk.des.no> <86fuzdqjwn.fsf@desk.des.no> Date: Tue, 8 Dec 2015 08:43:33 -0700 X-Google-Sender-Auth: UiMUWGSwZOK9bOA8cEcJ52dXJ8Q Message-ID: Subject: Fwd: DELETE support in the VOP_STRATEGY(9)? From: Warner Losh To: "freebsd-hackers@freebsd.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Dec 2015 15:43:35 -0000 [ forgot to cc hackers ] ---------- Forwarded message ---------- From: Warner Losh Date: Tue, Dec 8, 2015 at 8:41 AM Subject: Re: DELETE support in the VOP_STRATEGY(9)? To: Dag-Erling Sm=C3=B8rgrav On Tue, Dec 8, 2015 at 4:06 AM, Dag-Erling Sm=C3=B8rgrav wrote= : > Maxim Sobolev writes: > > Dag-Erling Sm=C3=B8rgrav writes: > > > 1) why did you take this off the list? > > There was a complain from list admin about this being off-topic. > > Yes, and Eitan moved the discussion to hackers@. It should have stayed > there. > > > > 2) why did you even bother to cc: me if you were going to competely > > > ignore everything I said anyway? > > I did not really ignore it, it just that I did not have much to reply > > at that point. [...] Basically I don't think your concerns wrt DELETE > > reliability/gurantees have much to do with this particular feature. > > The reason being that BIO_DELETE essentially tells the storage layer > > that whichever code "owns" the block in question (e.g. ZFS or UFS) > > has moved it into the free pool and will NEVER ever want to read its > > value back again (until it's written into again). > > No, it means that the contents of that block are no longer important and > that the lower layers *may* reclaim it. It does not mean that nobody > will ever try to read the block, nor does it guarantee that the block > will actually be reclaimed or zeroed. We cannot rely on the lower > layers to ensure that reading from a previously deleted block never > returns data that may have belonged to a different file. > > BTW, I've encountered CF cards (including the SanDisk card in my home > router) that freeze if issued a TRIM command. Furthermore, many CF, MMC > and SD cards, especially those marketed for use in digital cameras, > perform wear leveling "automagically" based on their own understanding > of the filesystem layout, and will therefore work poorly with anything > other than FAT (Kingston call it "optimized recording performance" in > their marketing literature). While these issues are relevant for BIO_DELETE, they aren't so much relevan= t for punching a hole in a file in a filesystem. The filesystem is the one that gets to decide whether and when to issue a BIO_DELETE (just as the lower layers get to decide what to do). A properly written filesystem will not issue a BIO_DELETE and then assume it will read back 0's. The whole point of the punch hole is to allow the filesystem to return the blocks to its free store. If that also happens to have the effect of causing a BIO_DELETE to go down, that's no different than deleting the file and having a BIO_DELETE go down for the resulting blocks that are freed. > > > Technically speaking on 100% correctly working os/hardware attempt to > > read block after it's been successfully BIO_DELETE'd could produce > > exception of some sort without any ill effects. > > If that were the case, it would never be safe to do > > # dd if=3D/dev/da0 of=3D/dev/da1 bs=3D4096 conv=3Dsparse > > which I'm sure you'll agree is not acceptable. > BIO_DELETE doesn't invalidate the LBA range, just its contents. LBAs are still required to read afterwards. This matches how the various standards dictate what the contents will be after whatever BIO_DELETE turns into. Maxim is simply wrong about this point, for this and many other reasons. > > [...] in this particular case of VOP_ALLOCATE(FALLOC_FL_PUNCH_HOLE), a > > filesystem in question is responsible for making sure the range that > > has been punched through reads 0, whether by making real logical hole > > in the file and/or by padding it with zeroes as needed. > > Is it really? > > Here are a few of our options for implementing FALLOC_FL_PUNCH_HOLE: > > a) create a filesystem-level hole in the disk image; > b) perform a), then issue a BIO_DELETE for the blocks that were > released; > c) perform a) or b), then zero the overspill if the requested range is > unaligned; > d) zero the entire range; > e) perform d) followed by either a) or b); > f) nothing at all. > I don't think f is an option. Unless it is OK to have random contents after creating a file and seeking some ways into and writing a byte. When you punch a hole in the file, you should get the same semantics as if you'd written up to just before the hole originally, then skipped to the end of the punched range and written the rest of the file. In Unix, that's well define= d to be 0's. It is undefined how those zeros are backed by the filesystem, or how much storage it takesup. A punch hole operation is a stronger statement about the contents after the fact than a BIO_DELETE operation. You are correct, though, that the decision to issue a BIO_DELETE is between the filesystem and the storage device. This makes a-e possible implementations, but some are stupider than others (which ones depend on the situation). Based on characteristics of both, the filesystem may return the blocks to its free store w/o doing anything further (if it frees them up at all). It could issue a BIO_DELETE on those blocks, if that is its policy. The device driver for th= e lower layers may return an error on the BIO_DELETE request or execute it faithfully. It cannot rely on it being a faster write zeros to the LBAs though. If it wants zeros, it has to write zeros. FreeBSD doesn't provide a way for the filesystem to know that the device implements BIO_DELETE as a guarnateed range of zeros after the operation completes, even if the device tells FreeBSD that information today as part of its IDENTIFY or INQUIRY data packets. > Now, consider the case of the guest OS in a VM issuing TRIM commands to > the emulated storage controller, which the hypervisor translates into a > FALLOC_FL_PUNCH_HOLE request for the corresponding range in the disk > image. Discuss the advantages and drawbacks of each option I listed > above for each of the 36 points in the space defined by the following > axes: > > - The disk image is: > - a preallocated file on a filesystem (or an md(4) device backed by a > preallocated file) > - a dynamically allocated file on a filesystem (or an md(4) device > backed by an unallocated file) > - a zvol > - a device > - The underlying storage's preferred block size is: > - small (e.g. 4 kB sectors on an AF drive) > - medium (e.g. 64 kB stripes on a RAID) > - large (e.g. 1 MB erase blocks on an SSD) > - The physical storage is: > - volatile > - solid-state > - electromechanical > > If you think the answer is the same in all cases, you are deluded. That's why these decisions are left to the stack. The only semantic that is required by the punch hole operation is that the filesystem return 0's on reads to that range. What the filesystem does to ensure this is up to the filesystem. As for md translating a BIO_DELETE into a PUNCH_HOLE, that's an acceptable thing for it to do (assuming we have a punch hole API). It is a stronger guarantee than is required by the BIO_DELETE API. However, PUNCH_HOLE should be implemented such that it is no slower than writes of zeros, and may be faster. Since md is doing writes of zeros today, this sounds like a possible win for those filesystems who implement the punch hole operation more efficiently than writing a block of zeros. And it may also allow the storage stack the chance to do an optimization that isn't present today. Warner