From owner-freebsd-arch@FreeBSD.ORG Mon Apr 11 20:52:13 2011 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B4E92106566C for ; Mon, 11 Apr 2011 20:52:13 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [65.120.238.197]) by mx1.freebsd.org (Postfix) with ESMTP id 89A818FC15 for ; Mon, 11 Apr 2011 20:52:13 +0000 (UTC) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.4/8.14.1) with ESMTP id p3BKfpB8070252 for ; Mon, 11 Apr 2011 13:41:51 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.4/8.13.4/Submit) id p3BKfp8n070251; Mon, 11 Apr 2011 13:41:51 -0700 (PDT) Date: Mon, 11 Apr 2011 13:41:51 -0700 (PDT) From: Matthew Dillon Message-Id: <201104112041.p3BKfp8n070251@apollo.backplane.com> To: arch@freebsd.org References: <132388F1-44D9-45C9-AE05-1799A7A2DCD9@neville-neil.com> <20110319160400.000043f5@unknown> <72B8E80C-E4C7-4763-A7B5-7A4441188C00@neville-neil.com> <20110320171122.00004613@unknown> Cc: Subject: Re: Updating our TCP and socket sysctl values... X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Apr 2011 20:52:13 -0000 This is a little late but I should note that FreeBSD had the inflight limiting code (which I wrote long ago) which could be turned on with a simple sysctl. It was replaced in September 2010 with the 'pluggable congestion control algorithm' which I am unfamiliar with, but which I assume has some equivalent replacement for the functionality. Also I believe FreeBSD turns on autosndbuf and autorcvbuf by default now, which means the buffers ARE being made larger than the defaults (assuming that tcp window shift is not disabled, otherwise you are limited to 65535 bytes no matter how big your buffers are). In anycase, the inflight/congestion-control algorithms essentially remove the buffer bloat issue for tcp transmit buffers. It doesn't matter HOW big your transmit buffer and the receive side's receive buffer is. People running servers need to turn on the inflight or equivalent limiter for several reasons: * BW x DELAY products are all over the map these days, creating surges if packet backlog is not controlled. Reasons for this vary but a good chunk can be blamed on ISPs and network providers who implement burst/backoff schemes (COMCAST is a good example, where your downlink bandwidth is cut in half after around 10 seconds at full bore). * Default tcp buffer sizes are too small. * Auto-sized tcp buffers are often too large (from a buffer bloat perspective). * Edge routers and other routers in the infrastructure have huge amounts of buffer memory these days, so drops don't really start to happen until AFTER the network has become almost unusable. * Turning it on significantly reduces the packet backlog at choke points in the network (usually the edge router), and significantly improves the ability for fair-share and QOS algorithms on the border router to manage traffic. That is, nearly all border routers are going to have some form of QOS, but there is a world of difference having to implement those algorithms for 500 simultanious connections with 3 packets of backlog per connection verses having to do it for 50 packets of backlog per connection. * You don't want to run RED on a border router, RED is designed for the middle of large switching networks. At the edges there are lots of other choices that do not require dropping packets randomly. Plus TCP SACK (which most sites now implement) tends to defeat RED these days, so when RED is used at all it the phrase 'random' becomes equivalent to 'frustration'. * Even if your SERVER has tons of bandwidth, probably a good number of the poor CLIENTS on the other end of the connection do not. If you don't control the backlog guess where all that backlog ends up at? Thats right, it ends up on the client-side border routers... for example, it ends up at the DSLAM if the client is a DSL user. These edge routers are the very last place where you ever want packet backlog to accumulate. They don't handle it well. So, basically that means some sort of transmit-side congestion control needs to be turned on... and frankly should be turned on by default. It just isn't optional any more. I've been screwing around with this stuff for a long time. I have colocated boxes with tons of bw, and servers running out of the house that don't (though ganging the uplink for U-Verse and COMCAST together is actually quite nice). I've played with the issue from both sides of the coin. Clients are a lot happier when servers don't build up 50+ packets of backlog from a single connection (let alone serveral), and routers can manage QOS better when the nearby servers don't blast THEM full of packets. As an example of this, pulling video TCP on my downlink from several concurrent sources over the native COMCAST or U-Verse links tended to create problems when those video streams were from well endowed servers. Eventually the only real solution was to run a VPN to a fast colocated box and run FAIRQ/PF on both sides to control the bandwidth in both directions, simply because there are too many video servers out there which do essentially no bandwidth management at all. Examining the queue showed multiple (video) servers out on the internet would happily build up 50+ packets per connection. Telco and cable providers that most clients are connected to just can't handle it without something blowing up. -Matt From owner-freebsd-arch@FreeBSD.ORG Thu Apr 14 19:35:37 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E6900106564A for ; Thu, 14 Apr 2011 19:35:37 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-wy0-f182.google.com (mail-wy0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id 7282E8FC17 for ; Thu, 14 Apr 2011 19:35:35 +0000 (UTC) Received: by wyf23 with SMTP id 23so2032816wyf.13 for ; Thu, 14 Apr 2011 12:35:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:date:x-google-sender-auth :message-id:subject:from:to:content-type; bh=EACQ69xesGNC7BGQ7XZD5kq6QytjK8QQgQcwrjETwrU=; b=hWpcXdCam7jY+zYMIDtbZwKej3PtRDpTFCdvnocRcklLeILy4WRUFG3YSU1w8NTMCH Q2HEBZvHN8rYhqH/gxfI1M33uZEzHCOx5xHEQYtUGH0HCHC+jPhbPh2WLZirsSSqJXYl y6IJ9e9uYAAdbNpMCZQue1HfvzPLDb7c5om3s= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; b=M7hW3l93af0ytBiJK55zzF2oljOjuxlayeWipEXr0FBqTLNqWb4xqWl7UDPRzpRvM4 QWmEGlQCkM5IZuVobZ71a93jUxxuTCJ/cHAjUG4u6QBXeZlDX5ldcqkbkY5lpA93h+UX nMUUYNSbSaswVZdJG7eM9D3bSGMin3YIpP+FA= MIME-Version: 1.0 Received: by 10.216.64.139 with SMTP id c11mr6955167wed.46.1302809734100; Thu, 14 Apr 2011 12:35:34 -0700 (PDT) Sender: mdf356@gmail.com Received: by 10.216.123.15 with HTTP; Thu, 14 Apr 2011 12:35:34 -0700 (PDT) Date: Thu, 14 Apr 2011 12:35:34 -0700 X-Google-Sender-Auth: 91_mRWzrwfsAv34aHi32iTz3Ybc Message-ID: From: mdf@FreeBSD.org To: FreeBSD Arch Content-Type: multipart/mixed; boundary=000e0cd56c5a6936b704a0e607a2 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Apr 2011 19:35:38 -0000 --000e0cd56c5a6936b704a0e607a2 Content-Type: text/plain; charset=ISO-8859-1 For work we need a functionality in our filesystem that is pretty much like posix_fallocate(2), so we're using the name and I've added a default VOP_ALLOCATE definition that does the right, but dumb, thing. The most recent mention of this function in FreeBSD was another thread lamenting it's failure to exist: http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268.html The attached files are the core of the kernel implementation of the syscall and a default VOP for any filesystem not supporting VOP_ALLOCATE, which allows the syscall to work as expected but in a non-performant manner. I didn't see this syscall in NetBSD or OpenBSD, so I plan to add it to the end of our syscall table. What I wanted to check with -arch about was: 1) is there still a desire for this syscall? 2) is this naive implementation useful enough to serve as a default for all filesystems until someone with more knowledge fills them in? 3) are there any obvious bugs or missing elements? Thanks, matthew --000e0cd56c5a6936b704a0e607a2 Content-Type: application/octet-stream; name="posix_fallocate.2" Content-Disposition: attachment; filename="posix_fallocate.2" Content-Transfer-Encoding: base64 X-Attachment-Id: f_gmi37vt21 LlwiIENvcHlyaWdodCAoYykgMTk4MCwgMTk5MSwgMTk5MwouXCIJVGhlIFJlZ2VudHMgb2YgdGhl IFVuaXZlcnNpdHkgb2YgQ2FsaWZvcm5pYS4gIEFsbCByaWdodHMgcmVzZXJ2ZWQuCi5cIgouXCIg UmVkaXN0cmlidXRpb24gYW5kIHVzZSBpbiBzb3VyY2UgYW5kIGJpbmFyeSBmb3Jtcywgd2l0aCBv ciB3aXRob3V0Ci5cIiBtb2RpZmljYXRpb24sIGFyZSBwZXJtaXR0ZWQgcHJvdmlkZWQgdGhhdCB0 aGUgZm9sbG93aW5nIGNvbmRpdGlvbnMKLlwiIGFyZSBtZXQ6Ci5cIiAxLiBSZWRpc3RyaWJ1dGlv bnMgb2Ygc291cmNlIGNvZGUgbXVzdCByZXRhaW4gdGhlIGFib3ZlIGNvcHlyaWdodAouXCIgICAg bm90aWNlLCB0aGlzIGxpc3Qgb2YgY29uZGl0aW9ucyBhbmQgdGhlIGZvbGxvd2luZyBkaXNjbGFp bWVyLgouXCIgMi4gUmVkaXN0cmlidXRpb25zIGluIGJpbmFyeSBmb3JtIG11c3QgcmVwcm9kdWNl IHRoZSBhYm92ZSBjb3B5cmlnaHQKLlwiICAgIG5vdGljZSwgdGhpcyBsaXN0IG9mIGNvbmRpdGlv bnMgYW5kIHRoZSBmb2xsb3dpbmcgZGlzY2xhaW1lciBpbiB0aGUKLlwiICAgIGRvY3VtZW50YXRp b24gYW5kL29yIG90aGVyIG1hdGVyaWFscyBwcm92aWRlZCB3aXRoIHRoZSBkaXN0cmlidXRpb24u Ci5cIiA0LiBOZWl0aGVyIHRoZSBuYW1lIG9mIHRoZSBVbml2ZXJzaXR5IG5vciB0aGUgbmFtZXMg b2YgaXRzIGNvbnRyaWJ1dG9ycwouXCIgICAgbWF5IGJlIHVzZWQgdG8gZW5kb3JzZSBvciBwcm9t b3RlIHByb2R1Y3RzIGRlcml2ZWQgZnJvbSB0aGlzIHNvZnR3YXJlCi5cIiAgICB3aXRob3V0IHNw ZWNpZmljIHByaW9yIHdyaXR0ZW4gcGVybWlzc2lvbi4KLlwiCi5cIiBUSElTIFNPRlRXQVJFIElT IFBST1ZJREVEIEJZIFRIRSBSRUdFTlRTIEFORCBDT05UUklCVVRPUlMgYGBBUyBJUycnIEFORAou XCIgQU5ZIEVYUFJFU1MgT1IgSU1QTElFRCBXQVJSQU5USUVTLCBJTkNMVURJTkcsIEJVVCBOT1Qg TElNSVRFRCBUTywgVEhFCi5cIiBJTVBMSUVEIFdBUlJBTlRJRVMgT0YgTUVSQ0hBTlRBQklMSVRZ IEFORCBGSVRORVNTIEZPUiBBIFBBUlRJQ1VMQVIgUFVSUE9TRQouXCIgQVJFIERJU0NMQUlNRUQu ICBJTiBOTyBFVkVOVCBTSEFMTCBUSEUgUkVHRU5UUyBPUiBDT05UUklCVVRPUlMgQkUgTElBQkxF Ci5cIiBGT1IgQU5ZIERJUkVDVCwgSU5ESVJFQ1QsIElOQ0lERU5UQUwsIFNQRUNJQUwsIEVYRU1Q TEFSWSwgT1IgQ09OU0VRVUVOVElBTAouXCIgREFNQUdFUyAoSU5DTFVESU5HLCBCVVQgTk9UIExJ TUlURUQgVE8sIFBST0NVUkVNRU5UIE9GIFNVQlNUSVRVVEUgR09PRFMKLlwiIE9SIFNFUlZJQ0VT OyBMT1NTIE9GIFVTRSwgREFUQSwgT1IgUFJPRklUUzsgT1IgQlVTSU5FU1MgSU5URVJSVVBUSU9O KQouXCIgSE9XRVZFUiBDQVVTRUQgQU5EIE9OIEFOWSBUSEVPUlkgT0YgTElBQklMSVRZLCBXSEVU SEVSIElOIENPTlRSQUNULCBTVFJJQ1QKLlwiIExJQUJJTElUWSwgT1IgVE9SVCAoSU5DTFVESU5H IE5FR0xJR0VOQ0UgT1IgT1RIRVJXSVNFKSBBUklTSU5HIElOIEFOWSBXQVkKLlwiIE9VVCBPRiBU SEUgVVNFIE9GIFRISVMgU09GVFdBUkUsIEVWRU4gSUYgQURWSVNFRCBPRiBUSEUgUE9TU0lCSUxJ VFkgT0YKLlwiIFNVQ0ggREFNQUdFLgouXCIKLlwiICAgICBAKCMpb3Blbi4yCTguMiAoQmVya2Vs ZXkpIDExLzE2LzkzCi5cIiAkRnJlZUJTRCQKLlwiCi5EZCBBcHJpbCAxMywgMjAxMQouRHQgUE9T SVhfRkFMTE9DQVRFIDIKLk9zCi5TaCBOQU1FCi5ObSBwb3NpeF9mYWxsb2NhdGUKLk5kIHByZS1h bGxvY2F0ZSBzdG9yYWdlIGZvciBhIHJhbmdlIGluIGEgZmlsZQouU2ggTElCUkFSWQouTGIgbGli YwouU2ggU1lOT1BTSVMKLkluIGZjbnRsLmgKLkZ0IGludAouRm4gcG9zaXhfZmFsbG9jYXRlICJp bnQgZmQiICJvZmZfdCBvZmZzZXQiICJvZmZfdCBsZW4iCi5TaCBERVNDUklQVElPTgpSZXF1aXJl ZCBzdG9yYWdlIGZvciB0aGUgcmFuZ2UKLkZhIG9mZnNldAp0bwouRmEgb2Zmc2V0ICsKLkZhIGxl bgppbiB0aGUgZmlsZSByZWZlcmVuY2VkIGJ5Ci5GYSBmZAppcyBndWFyYXRlZWQgdG8gYmUgYWxs b2NhdGVkIHVwb24gc3VjY2Vzc2Z1bCByZXR1cm4uClRoYXQgaXMsIGlmCi5GbiBwb3NpeF9mYWxs b2NhdGUKcmV0dXJucyBzdWNjZXNzZnVsbHksIHN1YnNlcXVlbnQgd3JpdGVzIHRvIHRoZSBzcGVj aWZpZWQgZmlsZSBkYXRhCndpbGwgbm90IGZhaWwgZHVlIHRvIGxhY2sgb2YgZnJlZSBzcGFjZSBv biB0aGUgZmlsZSBzeXN0ZW0gc3RvcmFnZQptZWRpYS4KQW55IGV4aXN0aW5nIGZpbGUgZGF0YSBp biB0aGUgc3BlY2lmaWVkIHJhbmdlIGlzIHVubW9kaWZpZWQuCklmCi5GYSBvZmZzZXQgKwouRmEg bGVuCmlzIGJleW9uZCB0aGUgY3VycmVudCBmaWxlIHNpemUsIHRoZW4KLkZuIHBvc2l4X2ZhbGxv Y2F0ZQp3aWxsIGFkanVzdCB0aGUgZmlsZSBzaXplIHRvCi5GYSBvZmZzZXQgKwouRmEgbGVuIC4K T3RoZXJ3aXNlLCB0aGUgZmlsZSBzaXplIHdpbGwgbm90IGJlIGNoYW5nZWQuCi5QcApTcGFjZSBh bGxvY2F0ZWQgYnkKLkZuIHBvc2l4X2ZhbGxvY2F0ZQp3aWxsIGJlIGZyZWVkIGJ5IGEgc3VjY2Vz c2Z1bCBjYWxsIHRvCi5YciBjcmVhdCAyCm9yCi5YciBvcGVuIDIKdGhhdCB0cnVuY2F0ZXMgdGhl IHNpemUgb2YgdGhlIGZpbGUuClNwYWNlIGFsbG9jYXRlZCB2aWEKLkZuIHBvc2l4X2ZhbGxvY2F0 ZQptYXkgYmUgZnJlZWQgYnkgYSBzdWNjZXNzZnVsIGNhbGwgdG8KLlhyIGZ0cnVuY2F0ZSAyCnRo YXQgcmVkdWNlcyB0aGUgZmlsZSBzaXplIHRvIGEgc2l6ZSBzbWFsbGVyIHRoYW4KLkZhIG9mZnNl dCArCi5GYSBsZW4gLgouUHAKLlNoIFJFVFVSTiBWQUxVRVMKSWYgc3VjY2Vzc2Z1bCwKLkZuIHBv c2l4X2ZhbGxvY2F0ZQpyZXR1cm5zIHplcm8uCkl0IHJldHVybnMgLTEgb24gZmFpbHVyZSwgYW5k IHNldHMKLlZhIGVycm5vCnRvIGluZGljYXRlIHRoZSBlcnJvci4KLlNoIEVSUk9SUwpQb3NzaWJs ZSBmYWlsdXJlIGNvbmRpdGlvbnM6Ci5CbCAtdGFnIC13aWR0aCBFcgouSXQgQnEgRXIgRUJBREYK VGhlCi5GYSBmZAphcmd1bWVudCBpcyBub3QgYSB2YWxpZCBmaWxlIGRlc2NyaXB0b3IuCi5JdCBC cSBFciBFQkFERgpUaGUKLkZhIGZkCmFyZ3VtZW50IHJlZmVyZW5jZXMgYSBmaWxlIHRoYXQgd2Fz IG9wZW5lZCB3aXRob3V0IHdyaXRlIHBlcm1pc3Npb24uCi5JdCBCcSBFciBFRkJJRwpUaGUgdmFs dWUgb2YKLkZhIG9mZnNldCArCi5GYSBsZW4KaXMgZ3JlYXRlciB0aGFuIHRoZSBtYXhpbXVtIGZp bGUgc2l6ZS4KLkl0IEJxIEVyIEVJTlRSCkEgc2lnbmFsIHdhcyBjYXVnaHQgZHVyaW5nIGV4ZWN1 dGlvbi4KLkl0IEJxIEVyIEVJTlZBTApUaGUKLkZhIGxlbgphcmd1bWVudCB3YXMgemVybyBvciB0 aGUKLkZhIG9mZnNldAphcmd1bWVudCB3YXMgbGVzcyB0aGFuIHplcm8uCi5JdCBCcSBFciBFSU8K QW4gSS9PIGVycm9yIG9jY3VycmVkIHdoaWxlIHJlYWRpbmcgZnJvbSBvciB3cml0aW5nIHRvIGEg ZmlsZSBzeXN0ZW0uCi5JdCBCcSBFciBFTk9ERVYKVGhlCi5GYSBmZAphcmd1bWVudCBkb2VzIG5v dCByZWZlciB0byBhIHJlZ3VsYXIgZmlsZS4KLkl0IEJxIEVyIEVOT1NQQwpUaGVyZSBpcyBpbnN1 ZmZpY2llbnQgZnJlZSBzcGFjZSByZW1haW5pbmcgb24gdGhlIGZpbGUgc3lzdGVtIHN0b3JhZ2UK bWVkaWEuCi5JdCBCcSBFciBFU1BJUEUKVGhlCi5GYSBmZAphcmd1bWVudCBpcyBhc3NvY2lhdGVk IHdpdGggYSBwaXBlIG9yIEZJRk8uCi5FbAouU2ggU0VFIEFMU08KLlhyIGNyZWF0IDIgLAouWHIg ZnRydW5jYXRlIDIgLAouWHIgb3BlbiAyICwKLlhyIHVubGluayAyCi5TaCBTVEFOREFSRFMKVGhl Ci5GbiBwb3NpeF9mYWxsb2NhdGUKc3lzdGVtIGNhbGwgY29uZm9ybXMgdG8KLlN0IC1wMTAwMy4x LTIwMDQgLgouU2ggSElTVE9SWQpUaGUKLkZuIHBvc2l4X2ZhbGxvY2F0ZQpmdW5jdGlvbiBhcHBl YXJlZCBpbgouRnggOS4wIC4KLlNoIEFVVEhPUlMKLkZuIHBvc2l4X2ZhbGxvY2F0ZQphbmQgdGhp cyBtYW51YWwgcGFnZSB3ZXJlIGluaXRpYWxseSB3cml0dGVuIGJ5Ci5BbiBNYXR0aGV3IEZsZW1p bmcgQXEgbWRmQEZyZWVCU0Qub3JnIC4K --000e0cd56c5a6936b704a0e607a2-- From owner-freebsd-arch@FreeBSD.ORG Thu Apr 14 19:37:15 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0E318106566C for ; Thu, 14 Apr 2011 19:37:15 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-ww0-f50.google.com (mail-ww0-f50.google.com [74.125.82.50]) by mx1.freebsd.org (Postfix) with ESMTP id 912788FC1B for ; Thu, 14 Apr 2011 19:37:14 +0000 (UTC) Received: by wwc33 with SMTP id 33so2315959wwc.31 for ; Thu, 14 Apr 2011 12:37:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; bh=YZlyIPViUhNZ3sGvSu4rRtHpXayDjM6Og/bR25ywDC0=; b=BWZuN4e335/tPHVkQh6WWlb1bFYm4ICS24B0PYepYDWs1+iaRQGiypQMGlTq60nziz 3q7A71RqBToaTnkRCBzdch2he3bKFt+MRIGoTHzHK/YEmJ7zysYkaOQoDMxlVIeHWduU JwXf9GQ54yJa33VQ2KBELFyxWsYfuOw007EB8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=Pe0kY2HLk6TvjwcxjPcfLnCvCDFggwVotWOA3gLSbSItQa4o/fKAEylDf+poGZP0O+ tq6zrS389b0pUriiwSvsPLqGImtJP1Uiw/mKzwwtvoA2mpqG6QtR2MknCHZpqYiYEPeb A9ZOiCFqgTNjCNAIM1puqrDMURbf4jZaqzcos= MIME-Version: 1.0 Received: by 10.216.254.90 with SMTP id g68mr1208705wes.16.1302809833485; Thu, 14 Apr 2011 12:37:13 -0700 (PDT) Sender: mdf356@gmail.com Received: by 10.216.123.15 with HTTP; Thu, 14 Apr 2011 12:37:13 -0700 (PDT) In-Reply-To: References: Date: Thu, 14 Apr 2011 12:37:13 -0700 X-Google-Sender-Auth: j5McqwBplmEW8dIISQneYfUhPV4 Message-ID: From: mdf@FreeBSD.org To: FreeBSD Arch Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Apr 2011 19:37:15 -0000 On Thu, Apr 14, 2011 at 12:35 PM, wrote: > For work we need a functionality in our filesystem that is pretty much > like posix_fallocate(2), so we're using the name and I've added a > default VOP_ALLOCATE definition that does the right, but dumb, thing. > > The most recent mention of this function in FreeBSD was another thread > lamenting it's failure to exist: > http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268.htm= l > > The attached files are the core of the kernel implementation of the > syscall and a default VOP for any filesystem not supporting > VOP_ALLOCATE, which allows the syscall to work as expected but in a > non-performant manner. =A0I didn't see this syscall in NetBSD or > OpenBSD, so I plan to add it to the end of our syscall table. I should note that I have a bunch of unit tests as well, but they're currently using $WORK's test harness, so I plan to figure out how to re-write them into the existing prove(1) harness. Thanks, matthew > > What I wanted to check with -arch about was: > > 1) is there still a desire for this syscall? > 2) is this naive implementation useful enough to serve as a default > for all filesystems until someone with more knowledge fills them in? > 3) are there any obvious bugs or missing elements? > > Thanks, > matthew > From owner-freebsd-arch@FreeBSD.ORG Thu Apr 14 21:34:37 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E22A0106566B; Thu, 14 Apr 2011 21:34:37 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id B86F28FC08; Thu, 14 Apr 2011 21:34:37 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 5283646B09; Thu, 14 Apr 2011 17:34:37 -0400 (EDT) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id D4A968A01B; Thu, 14 Apr 2011 17:34:36 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Thu, 14 Apr 2011 17:08:32 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110325; KDE/4.5.5; amd64; ; ) References: In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201104141708.32568.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Thu, 14 Apr 2011 17:34:36 -0400 (EDT) Cc: mdf@freebsd.org Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Apr 2011 21:34:38 -0000 On Thursday, April 14, 2011 3:35:34 pm mdf@freebsd.org wrote: > For work we need a functionality in our filesystem that is pretty much > like posix_fallocate(2), so we're using the name and I've added a > default VOP_ALLOCATE definition that does the right, but dumb, thing. > > The most recent mention of this function in FreeBSD was another thread > lamenting it's failure to exist: > http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268.html > > The attached files are the core of the kernel implementation of the > syscall and a default VOP for any filesystem not supporting > VOP_ALLOCATE, which allows the syscall to work as expected but in a > non-performant manner. I didn't see this syscall in NetBSD or > OpenBSD, so I plan to add it to the end of our syscall table. > > What I wanted to check with -arch about was: > > 1) is there still a desire for this syscall? > 2) is this naive implementation useful enough to serve as a default > for all filesystems until someone with more knowledge fills them in? > 3) are there any obvious bugs or missing elements? Hmm, this would be good to have. Unfortunately the list manager software ate everything except the manpage. Can you post the patches at a URL? -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Apr 14 22:01:49 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 53A24106566B; Thu, 14 Apr 2011 22:01:49 +0000 (UTC) (envelope-from gleb.kurtsou@gmail.com) Received: from mail-ww0-f50.google.com (mail-ww0-f50.google.com [74.125.82.50]) by mx1.freebsd.org (Postfix) with ESMTP id B2FEE8FC15; Thu, 14 Apr 2011 22:01:48 +0000 (UTC) Received: by wwc33 with SMTP id 33so2449905wwc.31 for ; Thu, 14 Apr 2011 15:01:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:date:from:to:cc:subject:message-id:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=q4nbrbGpF6ZA8uuJumyH96KShWm3plk6idjTHt2tYEs=; b=ZsHdY51boIFPmAi1gUcnEPM/gv9vmj+6TPlE4Nh/toU95pqCjstfwxUJkZSwt3KE3J 5EEfJ66/+Y9jNwYE7v5zm/W2kHdkdvRy8ZDU4FKh4XcrRY4LcRth6iCplCIJUjZJEMd+ vuBqKT6hAeNiDJak1GiR3pscZHw56MACe+0yA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=gdYDIXlfG7D8di3IvAicbt0S/jz+hualC8HzNZpHKFcqnQqEfarH17/VUNOfb19h3x E47q2gUdendRuSo/aygM3LA16SNk707zPGREIcEPR2Lc8XGvXcu49K2xw1oxvGem54tn y6VfzQ2bcG9zekpamon/LmlPlImijUksZ/Efs= Received: by 10.227.173.141 with SMTP id p13mr1316816wbz.64.1302816972310; Thu, 14 Apr 2011 14:36:12 -0700 (PDT) Received: from localhost (lan-78-157-92-5.vln.skynet.lt [78.157.92.5]) by mx.google.com with ESMTPS id e13sm1241742wbi.23.2011.04.14.14.36.11 (version=SSLv3 cipher=OTHER); Thu, 14 Apr 2011 14:36:11 -0700 (PDT) Date: Fri, 15 Apr 2011 00:36:10 +0300 From: Gleb Kurtsou To: mdf@FreeBSD.org Message-ID: <20110414213610.GB92382@tops> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Apr 2011 22:01:49 -0000 On (14/04/2011 12:35), mdf@FreeBSD.org wrote: > For work we need a functionality in our filesystem that is pretty much > like posix_fallocate(2), so we're using the name and I've added a > default VOP_ALLOCATE definition that does the right, but dumb, thing. > > The most recent mention of this function in FreeBSD was another thread > lamenting it's failure to exist: > http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268.html > > The attached files are the core of the kernel implementation of the > syscall and a default VOP for any filesystem not supporting > VOP_ALLOCATE, which allows the syscall to work as expected but in a > non-performant manner. I didn't see this syscall in NetBSD or > OpenBSD, so I plan to add it to the end of our syscall table. > > What I wanted to check with -arch about was: > > 1) is there still a desire for this syscall? It looks not to play well architecturally with modern COW file systems like ZFS and HUMMER. So potentially it can be implemented only for UFS. > 2) is this naive implementation useful enough to serve as a default > for all filesystems until someone with more knowledge fills them in? Maillist ate the patch. Only man page attached. > 3) are there any obvious bugs or missing elements? > > Thanks, > matthew > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From owner-freebsd-arch@FreeBSD.ORG Thu Apr 14 22:41:32 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 98F70106566B; Thu, 14 Apr 2011 22:41:32 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-wy0-f182.google.com (mail-wy0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id 08D118FC15; Thu, 14 Apr 2011 22:41:31 +0000 (UTC) Received: by wyf23 with SMTP id 23so2177724wyf.13 for ; Thu, 14 Apr 2011 15:41:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=sJlFQlEaz2BWsKmn00+seKx0+qvIhWtlCbWauJsHf/c=; b=LdmCbq4UzM3D0Qp9DsjQuyK7wKfRTnbTeIPi0YOAa+bE5sygrvdcqskO0BSap+y76l yVK2SQWeD3+sPDxsgctbwDP+DTZtQ2mpWmsfRr79/yCzh/sNlk54lsclhrawModwAYak AoB6VDYXzrD33519RyJ/TTicDIMYSosS5J+Wk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=fVPCL9CqxiAhwDVU5/h+SRZ6QkpZnsSmnvAt8qqG/oi7ATWJk91yGlyap29yyCOzpy jWzpj63UCtHdKRoqNhJjouQc5A/W9I5nhnj3pArRSTRj/dRx2uW4EIBJqUj0Cdoi+0z/ k/qDrud+8lgy2i9bi8VA/TGDxJWPY7mUhABTo= MIME-Version: 1.0 Received: by 10.216.254.142 with SMTP id h14mr1335515wes.31.1302820890662; Thu, 14 Apr 2011 15:41:30 -0700 (PDT) Sender: mdf356@gmail.com Received: by 10.216.123.15 with HTTP; Thu, 14 Apr 2011 15:41:30 -0700 (PDT) In-Reply-To: <20110414213610.GB92382@tops> References: <20110414213610.GB92382@tops> Date: Thu, 14 Apr 2011 15:41:30 -0700 X-Google-Sender-Auth: WGdARzJ-7PlNT0Ql2IBqAAl84lU Message-ID: From: mdf@FreeBSD.org To: Gleb Kurtsou , John Baldwin Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Apr 2011 22:41:32 -0000 On Thu, Apr 14, 2011 at 2:36 PM, Gleb Kurtsou wrot= e: > On (14/04/2011 12:35), mdf@FreeBSD.org wrote: >> For work we need a functionality in our filesystem that is pretty much >> like posix_fallocate(2), so we're using the name and I've added a >> default VOP_ALLOCATE definition that does the right, but dumb, thing. >> >> The most recent mention of this function in FreeBSD was another thread >> lamenting it's failure to exist: >> http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268.ht= ml >> >> The attached files are the core of the kernel implementation of the >> syscall and a default VOP for any filesystem not supporting >> VOP_ALLOCATE, which allows the syscall to work as expected but in a >> non-performant manner. =A0I didn't see this syscall in NetBSD or >> OpenBSD, so I plan to add it to the end of our syscall table. >> >> What I wanted to check with -arch about was: >> >> 1) is there still a desire for this syscall? > It looks not to play well architecturally with modern COW file systems > like ZFS and HUMMER. So potentially it can be implemented only for UFS. The syscall, or the dumb implementation? I don't see why the syscall itself would be a problem; presumably ZFS can figure out whether an fallocate() block is worth COWing or not... >> 2) is this naive implementation useful enough to serve as a default >> for all filesystems until someone with more knowledge fills them in? > Maillist ate the patch. Only man page attached. Whoops! http://people.freebsd.org/~mdf/bsd-fallocate.diff Cheers, matthew From owner-freebsd-arch@FreeBSD.ORG Fri Apr 15 02:26:48 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 403ED106564A; Fri, 15 Apr 2011 02:26:48 +0000 (UTC) (envelope-from alan.l.cox@gmail.com) Received: from mail-fx0-f54.google.com (mail-fx0-f54.google.com [209.85.161.54]) by mx1.freebsd.org (Postfix) with ESMTP id 9F0B08FC08; Fri, 15 Apr 2011 02:26:47 +0000 (UTC) Received: by fxm11 with SMTP id 11so2145442fxm.13 for ; Thu, 14 Apr 2011 19:26:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:reply-to:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=CdbjE1fRHu0QKk7BZwTkgIFAcBTVysEDnidonJJlTys=; b=S9zihSSnfVjTZBFQXmnSemx6+t0IT2qqmbnsUJAL0O3srWRdpKu4AYifVH1sMiBOgr RpzpxUblYsYhPvUaMI9jrwr0zO1BZbD2qYhQs80k39FbjI+F873LvV9zmFX+/HqYsetv o+WOOd+SQEtBk0VPNrEbohkutvmkFpEhfFBbg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; b=ECo6AX6qH0X5I+yxAt5DdqDMu/2c8a7raxpAjSoEKe+XVD3QTKFAhIWmM9hxySWd8T UeHz9NwCXidlZZnTEQMQOszsIWL0YpkzMS4VQI2tjGRX67E2bAAMIIMpo/UHrRSmNahN 30nrI5kljDLlyWvJHrzlQOhbz5U9Ev2z7flt8= MIME-Version: 1.0 Received: by 10.223.58.72 with SMTP id f8mr1489313fah.137.1302832724872; Thu, 14 Apr 2011 18:58:44 -0700 (PDT) Received: by 10.223.89.143 with HTTP; Thu, 14 Apr 2011 18:58:44 -0700 (PDT) In-Reply-To: References: Date: Thu, 14 Apr 2011 20:58:44 -0500 Message-ID: From: Alan Cox To: mdf@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: alc@freebsd.org List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Apr 2011 02:26:48 -0000 On Thu, Apr 14, 2011 at 2:35 PM, wrote: > For work we need a functionality in our filesystem that is pretty much > like posix_fallocate(2), so we're using the name and I've added a > default VOP_ALLOCATE definition that does the right, but dumb, thing. > > The most recent mention of this function in FreeBSD was another thread > lamenting it's failure to exist: > http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268.html > > The attached files are the core of the kernel implementation of the > syscall and a default VOP for any filesystem not supporting > VOP_ALLOCATE, which allows the syscall to work as expected but in a > non-performant manner. I didn't see this syscall in NetBSD or > OpenBSD, so I plan to add it to the end of our syscall table. > > What I wanted to check with -arch about was: > > 1) is there still a desire for this syscall? > Page 10 of my paper at http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf describes how it could improve Hadoop performance (if properly implemented). So, I would encourage you to add it. Alan From owner-freebsd-arch@FreeBSD.ORG Fri Apr 15 09:31:02 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4A50110656DD; Fri, 15 Apr 2011 09:31:02 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id D75A28FC21; Fri, 15 Apr 2011 09:31:01 +0000 (UTC) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id p3F9UvcQ012106 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 15 Apr 2011 12:30:57 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4) with ESMTP id p3F9UvmR090140; Fri, 15 Apr 2011 12:30:57 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4/Submit) id p3F9UvRi090139; Fri, 15 Apr 2011 12:30:57 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 15 Apr 2011 12:30:57 +0300 From: Kostik Belousov To: mdf@freebsd.org Message-ID: <20110415093057.GJ48734@deviant.kiev.zoral.com.ua> References: <20110414213610.GB92382@tops> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="HuXIgs6JvY9hJs5C" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-3.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00, DNS_FROM_OPENWHOIS autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: Gleb Kurtsou , FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Apr 2011 09:31:02 -0000 --HuXIgs6JvY9hJs5C Content-Type: text/plain; charset=koi8-r Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Apr 14, 2011 at 03:41:30PM -0700, mdf@freebsd.org wrote: > On Thu, Apr 14, 2011 at 2:36 PM, Gleb Kurtsou wr= ote: > > On (14/04/2011 12:35), mdf@FreeBSD.org wrote: > >> For work we need a functionality in our filesystem that is pretty much > >> like posix_fallocate(2), so we're using the name and I've added a > >> default VOP_ALLOCATE definition that does the right, but dumb, thing. > >> > >> The most recent mention of this function in FreeBSD was another thread > >> lamenting it's failure to exist: > >> http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268.= html > >> > >> The attached files are the core of the kernel implementation of the > >> syscall and a default VOP for any filesystem not supporting > >> VOP_ALLOCATE, which allows the syscall to work as expected but in a > >> non-performant manner. =9AI didn't see this syscall in NetBSD or > >> OpenBSD, so I plan to add it to the end of our syscall table. > >> > >> What I wanted to check with -arch about was: > >> > >> 1) is there still a desire for this syscall? > > It looks not to play well architecturally with modern COW file systems > > like ZFS and HUMMER. So potentially it can be implemented only for UFS. >=20 > The syscall, or the dumb implementation? I don't see why the syscall > itself would be a problem; presumably ZFS can figure out whether an > fallocate() block is worth COWing or not... >=20 > >> 2) is this naive implementation useful enough to serve as a default > >> for all filesystems until someone with more knowledge fills them in? > > Maillist ate the patch. Only man page attached. >=20 > Whoops! >=20 > http://people.freebsd.org/~mdf/bsd-fallocate.diff New syscall symbols for 9.0 should go in under FBSD_1.2 version, not FBSD_1= .0. You have inconsistent spacing in the kern_posix_fallocate(). I do not quite understand the locking for vnode you did. You marked the vop as taking and returning unlocked vnode. But, you do call VOP_GETATTR in the vop std implementation before locking the vnode. Did you tested with DEBUG_VFS_LOCKS config ? Usual (and proper) practice is to have such vop require locked vnode, in case of VOP_ALLOCATE, exclusive lock is appropriate. The Giant dance and vn_start_write() + vn_lock() go into kern_posix_fallocate() then. Also, you should call bwillwrite() before taking any vfs locks. Is locking/unlocking the vnode in loop is done to allow other callers to perform i/o on the vnode in between ? In particular, to truncate it ? I think this is not needed, and previous suggestion would take care of it. Why do you need stdallocate_extend() ? VOP_WRITE does the right thing with extending the vnode. You might find vn_rdwr easier to use then the bare vops. In particular, it would not omit the mac calls for read/write. --HuXIgs6JvY9hJs5C Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (FreeBSD) iEYEARECAAYFAk2oEFEACgkQC3+MBN1Mb4hjtgCgg7uxoPSepR7JPHDkdqaZUGrp 0pkAoOz8XPQ6Rtdju8bnj7JKGhnOliDi =Q8QX -----END PGP SIGNATURE----- --HuXIgs6JvY9hJs5C-- From owner-freebsd-arch@FreeBSD.ORG Fri Apr 15 10:54:16 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 41754106566B; Fri, 15 Apr 2011 10:54:16 +0000 (UTC) (envelope-from gleb.kurtsou@gmail.com) Received: from mail-ww0-f50.google.com (mail-ww0-f50.google.com [74.125.82.50]) by mx1.freebsd.org (Postfix) with ESMTP id 74FD98FC13; Fri, 15 Apr 2011 10:54:14 +0000 (UTC) Received: by wwc33 with SMTP id 33so2935676wwc.31 for ; Fri, 15 Apr 2011 03:54:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:date:from:to:cc:subject:message-id:references :mime-version:content-type:content-disposition :content-transfer-encoding:in-reply-to:user-agent; bh=8dcHTGAT71oEefxStGUOdY4ienvaZPZyBkQILsZp/i4=; b=ex/0itZLPpUPNEGm4L+8nYleb/2Ist30TAHs4lm5A6rx5T8xZ4gsZOaKAbscf9+QFR rR1XhOXxiuVD5XLRqPHGwWOMvpnB/YJKuAmzY/Niglgp4oIecCt2/7MWASCorQVg7SBy ylOf8EXk0bxqbWM1fL8unaFTfyTYR3tb6ikBo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:content-transfer-encoding :in-reply-to:user-agent; b=uDvCp3fMkfXk34H+rUDK7CfwaNz8eMjCX2tgObHOLW53gAZVh72Ze7Ypzv1+VOnyA5 DHJcUchHwr6ZVhi9oMUpqF9LiAQpYr4X0JTbXS/C0RQElmwaEm36aArMPtnOUOF41ScP xq22GIQt2hV2nISLOY7VSuKq8WhS60D6qMrrw= Received: by 10.227.0.140 with SMTP id 12mr1915517wbb.122.1302864854218; Fri, 15 Apr 2011 03:54:14 -0700 (PDT) Received: from localhost (lan-78-157-92-5.vln.skynet.lt [78.157.92.5]) by mx.google.com with ESMTPS id w12sm1537419wby.24.2011.04.15.03.54.12 (version=SSLv3 cipher=OTHER); Fri, 15 Apr 2011 03:54:13 -0700 (PDT) Date: Fri, 15 Apr 2011 13:54:09 +0300 From: Gleb Kurtsou To: mdf@FreeBSD.org Message-ID: <20110415105409.GA14344@tops> References: <20110414213610.GB92382@tops> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Apr 2011 10:54:16 -0000 On (14/04/2011 15:41), mdf@FreeBSD.org wrote: > On Thu, Apr 14, 2011 at 2:36 PM, Gleb Kurtsou wrote: > > On (14/04/2011 12:35), mdf@FreeBSD.org wrote: > >> For work we need a functionality in our filesystem that is pretty much > >> like posix_fallocate(2), so we're using the name and I've added a > >> default VOP_ALLOCATE definition that does the right, but dumb, thing. > >> > >> The most recent mention of this function in FreeBSD was another thread > >> lamenting it's failure to exist: > >> http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268.html > >> > >> The attached files are the core of the kernel implementation of the > >> syscall and a default VOP for any filesystem not supporting > >> VOP_ALLOCATE, which allows the syscall to work as expected but in a > >> non-performant manner.  I didn't see this syscall in NetBSD or > >> OpenBSD, so I plan to add it to the end of our syscall table. > >> > >> What I wanted to check with -arch about was: > >> > >> 1) is there still a desire for this syscall? > > It looks not to play well architecturally with modern COW file systems > > like ZFS and HUMMER. So potentially it can be implemented only for UFS. > > The syscall, or the dumb implementation? I don't see why the syscall > itself would be a problem; presumably ZFS can figure out whether an > fallocate() block is worth COWing or not... It is good to have if there is a chance to get a real implementation for UFS. Having only dumb implementation will fool user software that we support it. As far as I understand ZFS caches large chunk of changes and than writes all of them at once. I doubt blocks can be preallocated. You preallocate block, it's marked as used in file systems meta data, changes to meta data are written to disk -- it results in inconsistency because preallocated block is marked as "used" in meta data and thus can't be overwritten. I might be absolutely wrong, ZFS experts are better answer this. Grepping reveals no fallocate support in ZFS. > >> 2) is this naive implementation useful enough to serve as a default > >> for all filesystems until someone with more knowledge fills them in? > > Maillist ate the patch. Only man page attached. > > Whoops! > > http://people.freebsd.org/~mdf/bsd-fallocate.diff What was performance impact on copying large files? I had sparse file support in PEFS implemented similar way. Performance was terrible, vm and buf caches where saturated first by writing huge chunks of zeros and than by mmap'ing and writing actual data. sched_yeld() and/or vnode lock/unlock didn't improve interactive performance either. Why wouldn't you just call VOP_SETATTR(newsize) in dumb implementation. File systems expect files such behavior, cp is using mmap for a while already. > > Cheers, > matthew From owner-freebsd-arch@FreeBSD.ORG Fri Apr 15 12:36:40 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 23AED106566B; Fri, 15 Apr 2011 12:36:40 +0000 (UTC) (envelope-from jilles@stack.nl) Received: from mx1.stack.nl (relay04.stack.nl [IPv6:2001:610:1108:5010::107]) by mx1.freebsd.org (Postfix) with ESMTP id B5E008FC17; Fri, 15 Apr 2011 12:36:39 +0000 (UTC) Received: from turtle.stack.nl (turtle.stack.nl [IPv6:2001:610:1108:5010::132]) by mx1.stack.nl (Postfix) with ESMTP id 7D32C1DD9D8; Fri, 15 Apr 2011 14:36:38 +0200 (CEST) Received: by turtle.stack.nl (Postfix, from userid 1677) id 6F01F17376; Fri, 15 Apr 2011 14:36:38 +0200 (CEST) Date: Fri, 15 Apr 2011 14:36:38 +0200 From: Jilles Tjoelker To: Kostik Belousov Message-ID: <20110415123638.GA79988@stack.nl> References: <20110414213610.GB92382@tops> <20110415093057.GJ48734@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110415093057.GJ48734@deviant.kiev.zoral.com.ua> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: mdf@freebsd.org, Gleb Kurtsou , FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Apr 2011 12:36:40 -0000 On Fri, Apr 15, 2011 at 12:30:57PM +0300, Kostik Belousov wrote: > You might find vn_rdwr easier to use then the bare vops. In particular, > it would not omit the mac calls for read/write. I think omitting the MAC call for read is how it should be. The application does not read any data, the read is just the only way to force allocation without destroying existing data. posix_fallocate() should work for write-only files. -- Jilles Tjoelker From owner-freebsd-arch@FreeBSD.ORG Fri Apr 15 14:20:03 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CD5B6106566C for ; Fri, 15 Apr 2011 14:20:03 +0000 (UTC) (envelope-from pawel@dawidek.net) Received: from mail.garage.freebsd.pl (60.wheelsystems.com [83.12.187.60]) by mx1.freebsd.org (Postfix) with ESMTP id 641438FC0A for ; Fri, 15 Apr 2011 14:20:02 +0000 (UTC) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 0947145FA5; Fri, 15 Apr 2011 16:20:01 +0200 (CEST) Received: from localhost (89-73-195-149.dynamic.chello.pl [89.73.195.149]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id E083C45C8A; Fri, 15 Apr 2011 16:19:55 +0200 (CEST) Date: Fri, 15 Apr 2011 16:19:46 +0200 From: Pawel Jakub Dawidek To: Gleb Kurtsou Message-ID: <20110415141946.GB4526@garage.freebsd.pl> References: <20110414213610.GB92382@tops> <20110415105409.GA14344@tops> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="KFztAG8eRSV9hGtP" Content-Disposition: inline In-Reply-To: <20110415105409.GA14344@tops> X-OS: FreeBSD 9.0-CURRENT amd64 User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-0.6 required=4.5 tests=BAYES_00,RCVD_IN_SORBS_DUL autolearn=no version=3.0.4 Cc: mdf@FreeBSD.org, FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Apr 2011 14:20:03 -0000 --KFztAG8eRSV9hGtP Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Apr 15, 2011 at 01:54:09PM +0300, Gleb Kurtsou wrote: [...] > As far as I understand ZFS caches large chunk of changes and than writes > all of them at once. I doubt blocks can be preallocated. You preallocate > block, it's marked as used in file systems meta data, changes to meta > data are written to disk -- it results in inconsistency because > preallocated block is marked as "used" in meta data and thus can't > be overwritten. I might be absolutely wrong, ZFS experts are > better answer this. Grepping reveals no fallocate support in ZFS. [...] > Why wouldn't you just call VOP_SETATTR(newsize) in dumb implementation. > File systems expect files such behavior, cp is using mmap for a while > already. The idea behind posix_fallocate(2) is to guarantee that there will be enough space for future writes. It does make sense for ZFS too, because it will at least reserve space. VOP_SETATTR(newsize) won't reserve any space, it will just create a hole. --=20 Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com --KFztAG8eRSV9hGtP Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iEYEARECAAYFAk2oVAIACgkQForvXbEpPzTQigCgrM8GOvos6Ln/TyXJe/wH+GN7 FnEAoMJrDO3tRM6sNLZOmSoZedohVZN5 =+FhW -----END PGP SIGNATURE----- --KFztAG8eRSV9hGtP-- From owner-freebsd-arch@FreeBSD.ORG Fri Apr 15 14:31:47 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2A1581065670; Fri, 15 Apr 2011 14:31:47 +0000 (UTC) (envelope-from pawel@dawidek.net) Received: from mail.garage.freebsd.pl (60.wheelsystems.com [83.12.187.60]) by mx1.freebsd.org (Postfix) with ESMTP id 8E6148FC14; Fri, 15 Apr 2011 14:31:46 +0000 (UTC) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 1532645CD9; Fri, 15 Apr 2011 16:31:45 +0200 (CEST) Received: from localhost (89-73-195-149.dynamic.chello.pl [89.73.195.149]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id 473B245684; Fri, 15 Apr 2011 16:31:39 +0200 (CEST) Date: Fri, 15 Apr 2011 16:31:30 +0200 From: Pawel Jakub Dawidek To: mdf@FreeBSD.org Message-ID: <20110415143130.GC4526@garage.freebsd.pl> References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="iFRdW5/EC4oqxDHL" Content-Disposition: inline In-Reply-To: X-OS: FreeBSD 9.0-CURRENT amd64 User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-0.6 required=4.5 tests=BAYES_00,RCVD_IN_SORBS_DUL autolearn=no version=3.0.4 Cc: FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Apr 2011 14:31:47 -0000 --iFRdW5/EC4oqxDHL Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Apr 14, 2011 at 12:35:34PM -0700, mdf@FreeBSD.org wrote: > For work we need a functionality in our filesystem that is pretty much > like posix_fallocate(2), so we're using the name and I've added a > default VOP_ALLOCATE definition that does the right, but dumb, thing. >=20 > The most recent mention of this function in FreeBSD was another thread > lamenting it's failure to exist: > http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268.html >=20 > The attached files are the core of the kernel implementation of the > syscall and a default VOP for any filesystem not supporting > VOP_ALLOCATE, which allows the syscall to work as expected but in a > non-performant manner. I didn't see this syscall in NetBSD or > OpenBSD, so I plan to add it to the end of our syscall table. >=20 > What I wanted to check with -arch about was: >=20 > 1) is there still a desire for this syscall? > 2) is this naive implementation useful enough to serve as a default > for all filesystems until someone with more knowledge fills them in? > 3) are there any obvious bugs or missing elements? As I understand it you have two cases to consider: 1. The caller wants to reserve space in region that might be a hole, so we read and rewrite this region. 2. The caller wants to reserve space beyond file size. We need to write zeros there. For the first case I don't see a point in rewriting the block if it contains data that are not all-zeros. Hole can contain only zeros, so there is a place for optimization right there - skip write step if data is not all-zeros. Of course you need to know somehow what smallest block size file system uses. In case of ZFS overwriting hole with zeros won't reserve the space if you have compression turned on. All-zeros are turned into holes by ZFS internally when compression is on. The first case would be better implemented using SEEK_HOLE/SEEK_DATA, but those are not implemented yet in UFS, but will allow to find holes in the file and just overwrite them. You could entirely avoid reading and most of the writes in general purpose implementation. You could also add a flag to VFS_SET(9) to mark file systems that support holes. If file system doesn't support holes, first case might be skipped. For the second case I find it as a waste to first extend file size and then read those zeros. Why can't you just write zeros and avoid read step when you are extending file? --=20 Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com --iFRdW5/EC4oqxDHL Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iEYEARECAAYFAk2oVsEACgkQForvXbEpPzRXjQCgx0wDZsaeZUugBi9+sjYN+M4T wf8An2GK/pVsFb+Db/WUIGcttkvEruIi =N2pF -----END PGP SIGNATURE----- --iFRdW5/EC4oqxDHL-- From owner-freebsd-arch@FreeBSD.ORG Fri Apr 15 16:22:19 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B4C28106564A for ; Fri, 15 Apr 2011 16:22:19 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-ww0-f50.google.com (mail-ww0-f50.google.com [74.125.82.50]) by mx1.freebsd.org (Postfix) with ESMTP id 449A08FC18 for ; Fri, 15 Apr 2011 16:22:18 +0000 (UTC) Received: by wwc33 with SMTP id 33so3296107wwc.31 for ; Fri, 15 Apr 2011 09:22:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=tp+ww0FhwtA4amyCGNy4/1LPqGngef6V2mn0F+GFx+4=; b=W3+RzrW3u3e68THWIpcXnV2zEGZhzu8Og77z7/86Fg2ujN/3lo3AWPjYXhH/MXwxlh GKQVwFWDa0B8+5OswtPVei+3IJ8pFfKIJSROi31DDdgHT/GASaNg0uTFrW/Elf2YTN2e hbp7pVLyhwuUl1HQYlzJQVajFGDRtr1xxG5JE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=ldn+c3cr/Qq7WFlHYloLdXcq99uWXGeeeyhgst1G3q/x8Wg4lB2KVng0IhalQITI/6 TW8owauishw3UAn42WP4hOBs4n2JLr44d8V9r8qNtJpDS05AjqdBCKWvK8sCNkdPUfUj pg7Qly81yNoAYWwQ0Qv21QH2Fxd1ePLg9szCk= MIME-Version: 1.0 Received: by 10.216.64.139 with SMTP id c11mr7972800wed.46.1302884538071; Fri, 15 Apr 2011 09:22:18 -0700 (PDT) Sender: mdf356@gmail.com Received: by 10.216.123.15 with HTTP; Fri, 15 Apr 2011 09:22:18 -0700 (PDT) In-Reply-To: <20110415093057.GJ48734@deviant.kiev.zoral.com.ua> References: <20110414213610.GB92382@tops> <20110415093057.GJ48734@deviant.kiev.zoral.com.ua> Date: Fri, 15 Apr 2011 09:22:18 -0700 X-Google-Sender-Auth: EELFNNttp9t7on-371goGabOlh0 Message-ID: From: mdf@FreeBSD.org To: Kostik Belousov Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Apr 2011 16:22:19 -0000 2011/4/15 Kostik Belousov : > On Thu, Apr 14, 2011 at 03:41:30PM -0700, mdf@freebsd.org wrote: >> On Thu, Apr 14, 2011 at 2:36 PM, Gleb Kurtsou w= rote: >> > On (14/04/2011 12:35), mdf@FreeBSD.org wrote: >> >> For work we need a functionality in our filesystem that is pretty muc= h >> >> like posix_fallocate(2), so we're using the name and I've added a >> >> default VOP_ALLOCATE definition that does the right, but dumb, thing. >> >> >> >> The most recent mention of this function in FreeBSD was another threa= d >> >> lamenting it's failure to exist: >> >> http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268= .html >> >> >> >> The attached files are the core of the kernel implementation of the >> >> syscall and a default VOP for any filesystem not supporting >> >> VOP_ALLOCATE, which allows the syscall to work as expected but in a >> >> non-performant manner. =A0I didn't see this syscall in NetBSD or >> >> OpenBSD, so I plan to add it to the end of our syscall table. >> >> >> >> What I wanted to check with -arch about was: >> >> >> >> 1) is there still a desire for this syscall? >> > It looks not to play well architecturally with modern COW file systems >> > like ZFS and HUMMER. So potentially it can be implemented only for UFS= . >> >> The syscall, or the dumb implementation? =A0I don't see why the syscall >> itself would be a problem; presumably ZFS can figure out whether an >> fallocate() block is worth COWing or not... >> >> >> 2) is this naive implementation useful enough to serve as a default >> >> for all filesystems until someone with more knowledge fills them in? >> > Maillist ate the patch. Only man page attached. >> >> Whoops! >> >> http://people.freebsd.org/~mdf/bsd-fallocate.diff > > New syscall symbols for 9.0 should go in under FBSD_1.2 version, not FBSD= _1.0. Okay, fixed. > You have inconsistent spacing in the kern_posix_fallocate(). Oops; copy/paste error; fixed. > I do not quite understand the locking for vnode you did. > You marked the vop as taking and returning unlocked vnode. But, you > do call VOP_GETATTR in the vop std implementation before locking the vnod= e. > Did you tested with DEBUG_VFS_LOCKS config ? I have mostly tested on the version of FreeBSD we run at work which has some small KPI modifications. I will test and fix up on CURRENT once I figure out prove(1). As for locking: (1) For $WORK FreeBSD's locking of a "File" is problematic since we have both an inode lock and a data lock, and lots of times we don't really need the inode locked exclusively, just the data, which we handle inside the VOP. (2) I don't want to make 1TB allocated in a single operation, under a single lock, so the implementation is responsible for unlocking and taking a breather as needed. (3) I based the VOP_GETATTR on vn_stat which calls VOP_GETATTR without any lock. Except, hmm, it looks like vn_statfile(9) takes the lock. I was trying to avoid a lock/unlock cycle when the file didn't need to be extended, but I can put it back in. > Usual (and proper) practice is to have such vop require locked vnode, in > case of VOP_ALLOCATE, exclusive lock is appropriate. The Giant dance and > vn_start_write() + vn_lock() go into kern_posix_fallocate() then. > Also, you should call bwillwrite() before taking any vfs locks. > > Is locking/unlocking the vnode in loop is done to allow other callers > to perform i/o on the vnode in between ? In particular, to truncate it ? > I think this is not needed, and previous suggestion would take care of it= . See above; it is not acceptable in my mind to lock the vnode for the entire length of the operation, so the locking is managed by the VOP. > Why do you need stdallocate_extend() ? VOP_WRITE does the right thing > with extending the vnode. I was trying to simplify the implementation to a easy read/write loop since it isn't supposed to be performant but just get the right data. I could instead VOP_GETATTR on each loop to check file size and write zeros past the current file size, but that was more logic than a single VOP_SETATTR followed by read/write. > You might find vn_rdwr easier to use then the bare vops. In particular, > it would not omit the mac calls for read/write. I checked for write already in kern_posix_fallocate(). A single check should be sufficient. For other threads, please note I don't know anything about UFS implementations and I can't provide a ufs_allocte() that does rapid allocation of logically zero blocks. My intent is to provide the framework, a default implementation that meets the spec'd behaviour, and a set of testcases suitable to run for any filesystem that wants to verify their implementation. Thanks, matthew From owner-freebsd-arch@FreeBSD.ORG Fri Apr 15 16:28:03 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BC72A106566C for ; Fri, 15 Apr 2011 16:28:03 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-wy0-f182.google.com (mail-wy0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id 4E1668FC08 for ; Fri, 15 Apr 2011 16:28:02 +0000 (UTC) Received: by wyf23 with SMTP id 23so2882600wyf.13 for ; Fri, 15 Apr 2011 09:28:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=h3A/S9lGSg8QPpQQNdIRDD51o9bATN8LlRaedxcw0Yw=; b=W6TRCMYvw59cRDeIEf3I3DuifzWIG0jPFHUqlxlaYM3dAq9yz9ehZ5OiG8jia9LovY lyYH9rQUkwdV+PWX/7B2HuBt8qekqqj5Xt4tPSseqilwk/8ZNSv6g5C7bzJtxrbyvSmW GBZ6sO/VGxTl+qpRQWfUW5Cp34jz8prYXlG00= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=mXqThXF2bWAtamtnJZY8OXZQCUz+MJp0DIXUHetGCgWtsQPQ/25gtgYl6ABziE6d0H RAdpHohCn4rvrgSU6Hie67kaXipa4LDsylCqrPvToaukJq9tsZ4bCyDdYHw3HrpmeGaD DqsXrkqLza/4EuCHpuesy8l99k0yzHGOtXwss= MIME-Version: 1.0 Received: by 10.216.87.8 with SMTP id x8mr2205349wee.46.1302884881697; Fri, 15 Apr 2011 09:28:01 -0700 (PDT) Sender: mdf356@gmail.com Received: by 10.216.123.15 with HTTP; Fri, 15 Apr 2011 09:28:01 -0700 (PDT) In-Reply-To: <20110415105409.GA14344@tops> References: <20110414213610.GB92382@tops> <20110415105409.GA14344@tops> Date: Fri, 15 Apr 2011 09:28:01 -0700 X-Google-Sender-Auth: EbpFs5u0m462ZgcM4LfM5Kjor9o Message-ID: From: mdf@FreeBSD.org To: Gleb Kurtsou Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Apr 2011 16:28:03 -0000 On Fri, Apr 15, 2011 at 3:54 AM, Gleb Kurtsou wrot= e: > On (14/04/2011 15:41), mdf@FreeBSD.org wrote: >> On Thu, Apr 14, 2011 at 2:36 PM, Gleb Kurtsou w= rote: >> > On (14/04/2011 12:35), mdf@FreeBSD.org wrote: >> >> For work we need a functionality in our filesystem that is pretty muc= h >> >> like posix_fallocate(2), so we're using the name and I've added a >> >> default VOP_ALLOCATE definition that does the right, but dumb, thing. >> >> >> >> The most recent mention of this function in FreeBSD was another threa= d >> >> lamenting it's failure to exist: >> >> http://lists.freebsd.org/pipermail/freebsd-ports/2010-February/059268= .html >> >> >> >> The attached files are the core of the kernel implementation of the >> >> syscall and a default VOP for any filesystem not supporting >> >> VOP_ALLOCATE, which allows the syscall to work as expected but in a >> >> non-performant manner. =A0I didn't see this syscall in NetBSD or >> >> OpenBSD, so I plan to add it to the end of our syscall table. >> >> >> >> What I wanted to check with -arch about was: >> >> >> >> 1) is there still a desire for this syscall? >> > It looks not to play well architecturally with modern COW file systems >> > like ZFS and HUMMER. So potentially it can be implemented only for UFS= . >> >> The syscall, or the dumb implementation? =A0I don't see why the syscall >> itself would be a problem; presumably ZFS can figure out whether an >> fallocate() block is worth COWing or not... > It is good to have if there is a chance to get a real implementation for > UFS. Having only dumb implementation will fool user software that we > support it. > > As far as I understand ZFS caches large chunk of changes and than writes > all of them at once. I doubt blocks can be preallocated. You preallocate > block, it's marked as used in file systems meta data, changes to meta > data are written to disk -- it results in inconsistency because > preallocated block is marked as "used" in meta data and thus can't > be overwritten. I might be absolutely wrong, ZFS experts are > better answer this. Grepping reveals no fallocate support in ZFS. > >> >> 2) is this naive implementation useful enough to serve as a default >> >> for all filesystems until someone with more knowledge fills them in? >> > Maillist ate the patch. Only man page attached. >> >> Whoops! >> >> http://people.freebsd.org/~mdf/bsd-fallocate.diff > What was performance impact on copying large files? I don't know and I don't care. :-) Specifically, one problem is that there is no file-system implementation of "copy"; copy is implemented in userspace with read(2) then write(2). If the caller says posix_fallocate() then they want blocks. If copying a large file is slower after that, well, they asked for it. This implementation meets the spec only, it's not meant to be optimal. An optimal VOP_WRITE() implementation may check that e.g. the next block on write is all zero, and so will make a new logical-zero block in the same manner as VOP_FALLOCATE. This is up to each filesystem. > I had sparse file support in PEFS implemented similar way. posix_fallocate() is specifically to *not* have a sparse file. > Performance was terrible, vm > and buf caches where saturated first by writing huge chunks of zeros and > than by mmap'ing and writing actual data. sched_yeld() and/or vnode > lock/unlock didn't improve interactive performance either. > > Why wouldn't you just call VOP_SETATTR(newsize) in dumb implementation. > File systems expect files such behavior, cp is using mmap for a while > already. VOP_SETATTR(newsize) could truncate, if e.g. the file is already large and sparse and the fallocate(2) was to provide guaranteed storage only to the first 1MB. Thanks, matthew From owner-freebsd-arch@FreeBSD.ORG Fri Apr 15 16:42:43 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1864D1065675; Fri, 15 Apr 2011 16:42:43 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id C99D18FC14; Fri, 15 Apr 2011 16:42:42 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 63C1346B99; Fri, 15 Apr 2011 12:42:42 -0400 (EDT) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id F0DDA8A02B; Fri, 15 Apr 2011 12:42:41 -0400 (EDT) From: John Baldwin To: Kostik Belousov Date: Fri, 15 Apr 2011 12:32:59 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110325; KDE/4.5.5; amd64; ; ) References: <20110415093057.GJ48734@deviant.kiev.zoral.com.ua> In-Reply-To: <20110415093057.GJ48734@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201104151232.59770.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Fri, 15 Apr 2011 12:42:42 -0400 (EDT) Cc: mdf@freebsd.org, Gleb Kurtsou , FreeBSD Arch Subject: Re: posix_fallocate(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Apr 2011 16:42:43 -0000 On Friday, April 15, 2011 5:30:57 am Kostik Belousov wrote: > On Thu, Apr 14, 2011 at 03:41:30PM -0700, mdf@freebsd.org wrote: > > On Thu, Apr 14, 2011 at 2:36 PM, Gleb Kurtsou wrote: > > > On (14/04/2011 12:35), mdf@FreeBSD.org wrote: > > >> For work we need a functionality in our filesystem that is pretty much > > >> like posix_fallocate(2), so we're using the name and I've added a > > >> default VOP_ALLOCATE definition that does the right, but dumb, thing. > > >> > > >> The most recent mention of this function in FreeBSD was another thread > > >> lamenting it's failure to exist: > > >> http://lists.freebsd.org/pipermail/freebsd-ports/2010- February/059268.html > > >> > > >> The attached files are the core of the kernel implementation of the > > >> syscall and a default VOP for any filesystem not supporting > > >> VOP_ALLOCATE, which allows the syscall to work as expected but in a > > >> non-performant manner. I didn't see this syscall in NetBSD or > > >> OpenBSD, so I plan to add it to the end of our syscall table. > > >> > > >> What I wanted to check with -arch about was: > > >> > > >> 1) is there still a desire for this syscall? > > > It looks not to play well architecturally with modern COW file systems > > > like ZFS and HUMMER. So potentially it can be implemented only for UFS. > > > > The syscall, or the dumb implementation? I don't see why the syscall > > itself would be a problem; presumably ZFS can figure out whether an > > fallocate() block is worth COWing or not... > > > > >> 2) is this naive implementation useful enough to serve as a default > > >> for all filesystems until someone with more knowledge fills them in? > > > Maillist ate the patch. Only man page attached. > > > > Whoops! > > > > http://people.freebsd.org/~mdf/bsd-fallocate.diff > > New syscall symbols for 9.0 should go in under FBSD_1.2 version, not FBSD_1.0. > > You have inconsistent spacing in the kern_posix_fallocate(). > > I do not quite understand the locking for vnode you did. > You marked the vop as taking and returning unlocked vnode. But, you > do call VOP_GETATTR in the vop std implementation before locking the vnode. > Did you tested with DEBUG_VFS_LOCKS config ? > > Usual (and proper) practice is to have such vop require locked vnode, in > case of VOP_ALLOCATE, exclusive lock is appropriate. The Giant dance and > vn_start_write() + vn_lock() go into kern_posix_fallocate() then. > Also, you should call bwillwrite() before taking any vfs locks. > > Is locking/unlocking the vnode in loop is done to allow other callers > to perform i/o on the vnode in between ? In particular, to truncate it ? > I think this is not needed, and previous suggestion would take care of it. > > Why do you need stdallocate_extend() ? VOP_WRITE does the right thing > with extending the vnode. > > You might find vn_rdwr easier to use then the bare vops. In particular, > it would not omit the mac calls for read/write. I agree with pretty much all of this esp. as regards the locking, etc. -- John Baldwin