From owner-freebsd-arch@FreeBSD.ORG Fri Apr 15 16:22:19 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B4C28106564A for ; Fri, 15 Apr 2011 16:22:19 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-ww0-f50.google.com (mail-ww0-f50.google.com [74.125.82.50]) by mx1.freebsd.org (Postfix) with ESMTP id 449A08FC18 for ; Fri, 15 Apr 2011 16:22:18 +0000 (UTC) Received: by wwc33 with SMTP id 33so3296107wwc.31 for ; Fri, 15 Apr 2011 09:22:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=tp+ww0FhwtA4amyCGNy4/1LPqGngef6V2mn0F+GFx+4=; b=W3+RzrW3u3e68THWIpcXnV2zEGZhzu8Og77z7/86Fg2ujN/3lo3AWPjYXhH/MXwxlh GKQVwFWDa0B8+5OswtPVei+3IJ8pFfKIJSROi31DDdgHT/GASaNg0uTFrW/Elf2YTN2e hbp7pVLyhwuUl1HQYlzJQVajFGDRtr1xxG5JE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=ldn+c3cr/Qq7WFlHYloLdXcq99uWXGeeeyhgst1G3q/x8Wg4lB2KVng0IhalQITI/6 TW8owauishw3UAn42WP4hOBs4n2JLr44d8V9r8qNtJpDS05AjqdBCKWvK8sCNkdPUfUj pg7Qly81yNoAYWwQ0Qv21QH2Fxd1ePLg9szCk= MIME-Version: 1.0 Received: by 10.216.64.139 with SMTP id c11mr7972800wed.46.1302884538071; Fri, 15 Apr 2011 09:22:18 -0700 (PDT) Sender: mdf356@gmail.com Received: by 10.216.123.15 with HTTP; Fri, 15 Apr 2011 09:22:18 -0700 (PDT) In-Reply-To: <20110415093057.GJ48734@deviant.kiev.zoral.com.ua> References: <20110414213610.GB92382@tops> <20110415093057.GJ48734@deviant.kiev.zoral.com.ua> Date: Fri, 15 Apr 2011 09:22:18 -0700 X-Google-Sender-Auth: EELFNNttp9t7on-371goGabOlh0 Message-ID: From: mdf@FreeBSD.org To: Kostik Belousov Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Apr 2011 16:22:19 -0000 2011/4/15 Kostik Belousov : > On Thu, Apr 14, 2011 at 03:41:30PM -0700, mdf@freebsd.org wrote: >> On Thu, Apr 14, 2011 at 2:36 PM, Gleb Kurtsou w= rote: >> > On (14/04/2011 12:35), mdf@FreeBSD.org wrote: >> >> For work we need a functionality in our filesystem that is pretty muc= h >> >> like posix_fallocate(2), so we're using the name and I've added a >> >> default VOP_ALLOCATE definition that does the right, but dumb, thing. >> >> >> >> The most recent mention of this function in FreeBSD was another threa= d >> >> lamenting it's failure to exist: >> >> http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268= .html >> >> >> >> The attached files are the core of the kernel implementation of the >> >> syscall and a default VOP for any filesystem not supporting >> >> VOP_ALLOCATE, which allows the syscall to work as expected but in a >> >> non-performant manner. =A0I didn't see this syscall in NetBSD or >> >> OpenBSD, so I plan to add it to the end of our syscall table. >> >> >> >> What I wanted to check with -arch about was: >> >> >> >> 1) is there still a desire for this syscall? >> > It looks not to play well architecturally with modern COW file systems >> > like ZFS and HUMMER. So potentially it can be implemented only for UFS= . >> >> The syscall, or the dumb implementation? =A0I don't see why the syscall >> itself would be a problem; presumably ZFS can figure out whether an >> fallocate() block is worth COWing or not... >> >> >> 2) is this naive implementation useful enough to serve as a default >> >> for all filesystems until someone with more knowledge fills them in? >> > Maillist ate the patch. Only man page attached. >> >> Whoops! >> >> http://people.freebsd.org/~mdf/bsd-fallocate.diff > > New syscall symbols for 9.0 should go in under FBSD_1.2 version, not FBSD= _1.0. Okay, fixed. > You have inconsistent spacing in the kern_posix_fallocate(). Oops; copy/paste error; fixed. > I do not quite understand the locking for vnode you did. > You marked the vop as taking and returning unlocked vnode. But, you > do call VOP_GETATTR in the vop std implementation before locking the vnod= e. > Did you tested with DEBUG_VFS_LOCKS config ? I have mostly tested on the version of FreeBSD we run at work which has some small KPI modifications. I will test and fix up on CURRENT once I figure out prove(1). As for locking: (1) For $WORK FreeBSD's locking of a "File" is problematic since we have both an inode lock and a data lock, and lots of times we don't really need the inode locked exclusively, just the data, which we handle inside the VOP. (2) I don't want to make 1TB allocated in a single operation, under a single lock, so the implementation is responsible for unlocking and taking a breather as needed. (3) I based the VOP_GETATTR on vn_stat which calls VOP_GETATTR without any lock. Except, hmm, it looks like vn_statfile(9) takes the lock. I was trying to avoid a lock/unlock cycle when the file didn't need to be extended, but I can put it back in. > Usual (and proper) practice is to have such vop require locked vnode, in > case of VOP_ALLOCATE, exclusive lock is appropriate. The Giant dance and > vn_start_write() + vn_lock() go into kern_posix_fallocate() then. > Also, you should call bwillwrite() before taking any vfs locks. > > Is locking/unlocking the vnode in loop is done to allow other callers > to perform i/o on the vnode in between ? In particular, to truncate it ? > I think this is not needed, and previous suggestion would take care of it= . See above; it is not acceptable in my mind to lock the vnode for the entire length of the operation, so the locking is managed by the VOP. > Why do you need stdallocate_extend() ? VOP_WRITE does the right thing > with extending the vnode. I was trying to simplify the implementation to a easy read/write loop since it isn't supposed to be performant but just get the right data. I could instead VOP_GETATTR on each loop to check file size and write zeros past the current file size, but that was more logic than a single VOP_SETATTR followed by read/write. > You might find vn_rdwr easier to use then the bare vops. In particular, > it would not omit the mac calls for read/write. I checked for write already in kern_posix_fallocate(). A single check should be sufficient. For other threads, please note I don't know anything about UFS implementations and I can't provide a ufs_allocte() that does rapid allocation of logically zero blocks. My intent is to provide the framework, a default implementation that meets the spec'd behaviour, and a set of testcases suitable to run for any filesystem that wants to verify their implementation. Thanks, matthew