Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 15 Apr 2011 09:28:01 -0700
From:      mdf@FreeBSD.org
To:        Gleb Kurtsou <gleb.kurtsou@gmail.com>
Cc:        FreeBSD Arch <freebsd-arch@freebsd.org>
Subject:   Re: posix_fallocate(2)
Message-ID:  <BANLkTikHbVPbcQ=0zLyFG2Ur7rUs-0Xh2Q@mail.gmail.com>
In-Reply-To: <20110415105409.GA14344@tops>
References:  <BANLkTimYzJ11w9X1OHShEn2wi6gjHx=YjA@mail.gmail.com> <20110414213610.GB92382@tops> <BANLkTi=OWUnB_ue3RT4bzGNvivZwW_ofkA@mail.gmail.com> <20110415105409.GA14344@tops>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Apr 15, 2011 at 3:54 AM, Gleb Kurtsou <gleb.kurtsou@gmail.com> wrot=
e:
> On (14/04/2011 15:41), mdf@FreeBSD.org wrote:
>> On Thu, Apr 14, 2011 at 2:36 PM, Gleb Kurtsou <gleb.kurtsou@gmail.com> w=
rote:
>> > On (14/04/2011 12:35), mdf@FreeBSD.org wrote:
>> >> For work we need a functionality in our filesystem that is pretty muc=
h
>> >> like posix_fallocate(2), so we're using the name and I've added a
>> >> default VOP_ALLOCATE definition that does the right, but dumb, thing.
>> >>
>> >> The most recent mention of this function in FreeBSD was another threa=
d
>> >> lamenting it's failure to exist:
>> >> http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268=
.html
>> >>
>> >> The attached files are the core of the kernel implementation of the
>> >> syscall and a default VOP for any filesystem not supporting
>> >> VOP_ALLOCATE, which allows the syscall to work as expected but in a
>> >> non-performant manner. =A0I didn't see this syscall in NetBSD or
>> >> OpenBSD, so I plan to add it to the end of our syscall table.
>> >>
>> >> What I wanted to check with -arch about was:
>> >>
>> >> 1) is there still a desire for this syscall?
>> > It looks not to play well architecturally with modern COW file systems
>> > like ZFS and HUMMER. So potentially it can be implemented only for UFS=
.
>>
>> The syscall, or the dumb implementation? =A0I don't see why the syscall
>> itself would be a problem; presumably ZFS can figure out whether an
>> fallocate() block is worth COWing or not...
> It is good to have if there is a chance to get a real implementation for
> UFS. Having only dumb implementation will fool user software that we
> support it.
>
> As far as I understand ZFS caches large chunk of changes and than writes
> all of them at once. I doubt blocks can be preallocated. You preallocate
> block, it's marked as used in file systems meta data, changes to meta
> data are written to disk -- it results in inconsistency because
> preallocated block is marked as "used" in meta data and thus can't
> be overwritten. I might be absolutely wrong, ZFS experts are
> better answer this. Grepping reveals no fallocate support in ZFS.
>
>> >> 2) is this naive implementation useful enough to serve as a default
>> >> for all filesystems until someone with more knowledge fills them in?
>> > Maillist ate the patch. Only man page attached.
>>
>> Whoops!
>>
>> http://people.freebsd.org/~mdf/bsd-fallocate.diff
> What was performance impact on copying large files?

I don't know and I don't care. :-)  Specifically, one problem is that
there is no file-system implementation of "copy"; copy is implemented
in userspace with read(2) then write(2).

If the caller says posix_fallocate() then they want blocks.  If
copying a large file is slower after that, well, they asked for it.
This implementation meets the spec only, it's not meant to be optimal.
 An optimal VOP_WRITE() implementation may check that e.g. the next
block on write is all zero, and so will make a new logical-zero block
in the same manner as VOP_FALLOCATE.  This is up to each filesystem.

> I had sparse file support in PEFS implemented similar way.

posix_fallocate() is specifically to *not* have a sparse file.

> Performance was terrible, vm
> and buf caches where saturated first by writing huge chunks of zeros and
> than by mmap'ing and writing actual data. sched_yeld() and/or vnode
> lock/unlock didn't improve interactive performance either.
>
> Why wouldn't you just call VOP_SETATTR(newsize) in dumb implementation.
> File systems expect files such behavior, cp is using mmap for a while
> already.

VOP_SETATTR(newsize) could truncate, if e.g. the file is already large
and sparse and the fallocate(2) was to provide guaranteed storage only
to the first 1MB.

Thanks,
matthew



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BANLkTikHbVPbcQ=0zLyFG2Ur7rUs-0Xh2Q>