From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 02:42:47 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 209E187A; Sun, 10 Mar 2013 02:42:47 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id B9EF47B5; Sun, 10 Mar 2013 02:42:46 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqEEACryO1GDaFvO/2dsb2JhbABCiCi8JIF0dIIsAQEBAwEBAQEgBCcgCxsYAgINGQIpAQkmBggHBAEcBIdsBgyqI5FvgSOMOn00B4ItgRMDiHKLJYI+gR6PV4MoHjKBBTU X-IronPort-AV: E=Sophos;i="4.84,816,1355115600"; d="scan'208";a="20301161" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 09 Mar 2013 21:42:45 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id B9596B3F46; Sat, 9 Mar 2013 21:42:45 -0500 (EST) Date: Sat, 9 Mar 2013 21:42:45 -0500 (EST) From: Rick Macklem To: Garrett Wollman Message-ID: <663916089.3736429.1362883365710.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <20795.30884.330015.123616@hergotha.csail.mit.edu> Subject: Re: NFS DRC size MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: freebsd-fs@freebsd.org, freebsd-net@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 02:42:47 -0000 Garrett Wollman wrote: > < said: > > > around the highwater mark basically indicates this is working. If it > > wasn't > > throwing away replies where the receipt has been ack'd at the TCP > > level, the cache would grow very large, since they would only be > > discarded after a loonnngg timeout (12hours unless you've changes > > NFSRVCACHE_TCPTIMEOUT in sys/fs/nfs/nfs.h). > > That seems unreasonably large. > I suppose. How long a network partitioning do you want the cache to deal with? (My original design was trying to achieve a high level of correctness by default.) The only time cache entries normally hang around this long is when a client has dismounted the volume(s) using the TCP connection. The cached replies for the last few replies will then hang around until the timeout. For a few clients this isn't an issue. For 2,000 clients, I can see that it might be, if the clients choose to dismount volumes (using something like amd). Feel free to make it smaller, based on the longest network partitioning that you anticipate might occur. > > Well, the DRC will try to cache replies until the client's TCP layer > > acknowledges receipt of the reply. It is hard to say how many > > replies > > that is for a given TCP connection, since it is a function of the > > level > > of concurrently (# of nfsiod threads in the FreeBSD client) > > in the client. I'd guess it's somewhere between 1<->20? > > Nearly all our clients are Linux, so it's likely to be whatever Debian > does by default. > > > Multiply that by the number of TCP connections from all clients and > > you have about how big the server's DRC will be. (Some clients use > > a single TCP connection for the client whereas others use a separate > > TCP connection for each mount point.) > > The Debian client appears to use a single TCP connection for > everything. > > So if I want to support 2,000 clients each with 20 requests in flight, > that would suggest that I need a DRC size of 40,000, which my > experience shows is not sufficient with even a much smaller number of > clients. > Well, especially since Debian is using one TCP connection for everything from a client, the guess of 20 could be way low. rick > -GAWollman > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 05:13:31 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 995D65AB for ; Sun, 10 Mar 2013 05:13:31 +0000 (UTC) (envelope-from cr@caltel.com) Received: from mail2.caltel.com (mail2.caltel.com [66.102.145.6]) by mx1.freebsd.org (Postfix) with ESMTP id 82C84D10 for ; Sun, 10 Mar 2013 05:13:31 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap4EAJQVPFFCZpCq/2dsb2JhbABDwW6EVXSCKwEBBDlGNzQCWQgBAYgJBptdoDySVQOIco1jhWeLDoMrGw X-IPAS-Result: Ap4EAJQVPFFCZpCq/2dsb2JhbABDwW6EVXSCKwEBBDlGNzQCWQgBAYgJBptdoDySVQOIco1jhWeLDoMrGw X-IronPort-AV: E=Sophos;i="4.84,817,1355126400"; d="scan'208";a="775349" Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local) ([66.102.144.170]) by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA; 09 Mar 2013 21:12:10 -0800 Message-ID: <513C1629.50501@caltel.com> Date: Sat, 09 Mar 2013 21:12:09 -0800 From: Cody Ritts Organization: CalTel User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130216 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Aligning MBR for ZFS boot help Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 05:13:31 -0000 Hello all, I am really struggling to understand what is going on, if anyone could tell me where I am going wrong, I would greatly appreciate it. I have a new intel atom appliance that will not boot from a GPT partition table. It came with an SSD, so I am trying to align it to 1MB for the erase block size. All of these commands are being run from a 9.1-RELEASE-amd64-memstick These commands partition the drive, the system boots just fine: > gpart create -s mbr ada0 > gpart add -t freebsd ada0 > gpart create -s bsd ada0s1 > gpart add -s 52862M -t freebsd-zfs ada0s1 > gpart add -s 8G -t freebsd-swap ada0s1 > gpart set -a active -i 1 ada0 > gpart bootcode -b /boot/mbr ada0 > dd if=/boot/zfsboot of=/dev/ada0s1 count=1 > dd if=/boot/zfsboot of=/dev/ada0s1a skip=1 seek=1024 This is the gpart print output of those commands > => 63 125045361 ada0 MBR (59G) > 63 125045361 1 freebsd [active] (59G) > > => 0 125045361 ada0s1 BSD (59G) > 0 108261376 1 freebsd-zfs (51G) > 108261376 16777216 2 freebsd-swap (8.0G) > 125038592 6769 - free - (3.3M) Here is my disk info > root@:/root # diskinfo -v ada0 > ada0 > 512 # sectorsize > 64023257088 # mediasize in bytes (59G) > 125045424 # mediasize in sectors > 0 # stripesize > 0 # stripeoffset > 124053 # Cylinders according to firmware. > 16 # Heads according to firmware. > 63 # Sectors according to firmware. 125045361 + 63 = 125045424 So gpart is for sure printing sectors. freebsd-zfs starts at sector 63 So, I need that freebsd-zfs slice to start at 1MB 1MB = 2048s 2048 - 63 = 1985 so if I add an offset to my slice: > gpart add -b 1985 -s 52862M -t freebsd-zfs ada0s1 should start me at 2048. > => 63 125045361 ada0 MBR (59G) > 63 125045361 1 freebsd [active] (59G) > => 0 125045361 ada0s1 BSD (59G) > 0 1985 - free - (992k) > 1985 108261376 1 freebsd-zfs (51G) BUT, when i boot, I get this: > zfsboot: No ZFS Pools located, can't boot I think remember reading that freebsd-zfs had to be the first slice (I cannot remember where i read that). And it apparently does not think an offset is funny. So, that leaves me with trying to adjust my MBR partition, so I start over and run: > gpart add -b 1985 -t freebsd ada0 but that gives me: > => 63 125045361 ada0 MBR (59G) > 63 1953 - free - (976k) > 2016 125043408 1 freebsd (59G) HHHMMMMM???? well, 2016 - 1953 = 63 coincidence? i doubt it, but I dont get it. Poking around on the internet, it looks like gpart is possibly enforcing geometry boundaries? so I do the following: > sysctl kern.geom.part.check_integrity=0 > root@:/root # gpart add -a 1m -t freebsd ada0 > ada0s1 added > root@:/root # gpart show > => 63 125045361 ada0 MBR (59G) > 63 2016 - free - (1M) > 2079 125042652 1 freebsd (59G) > 125044731 693 - free - (346k) Obviously still didnt work. I try a 10MB offset. 10MB = 20480s 20480-63 = 20417s > gpart add -b 20417 -t freebsd ada0 > => 63 125045361 ada0 MBR (59G) > 63 20412 - free - (10M) > 20475 125024949 1 freebsd (59G) It is still just a few sectors off. So what if i let gpart automatically align it. > gpart add -a 1m -t freebsd ada0 > => 63 125045361 ada0 MBR (59G) > 63 2016 - free - (1M) > 2079 125042652 1 freebsd (59G) > 125044731 693 - free - (346k) And 2079 is still != 2048. I have tried adjusting those numbers one by one, and it just hops around the number I am looking for. I have tried adding partitions in-front of it, setting the alignment to 1s, and adjusting the size. I cannot get it to land on 2048. It does boot with the padding in the MBR table, but I don't think it is aligned. Maybe it is aligned, and I just dont know any better. I am at a loss. Any suggestions would be greatly appreciated. Thanks, Cody From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 06:12:24 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 001A0F1B for ; Sun, 10 Mar 2013 06:12:23 +0000 (UTC) (envelope-from jdavidlists@gmail.com) Received: from mail-ia0-x234.google.com (mail-ia0-x234.google.com [IPv6:2607:f8b0:4001:c02::234]) by mx1.freebsd.org (Postfix) with ESMTP id B82EEE39 for ; Sun, 10 Mar 2013 06:12:23 +0000 (UTC) Received: by mail-ia0-f180.google.com with SMTP id f27so2641701iae.39 for ; Sat, 09 Mar 2013 22:12:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=w6/rYYWSyXwzz89OE/o5rZnnfZ/WdbZVvdodMSfphAQ=; b=INB4eZmwilgwy7KnF1pJpESNXnW+/q3M8nhKnNH1Q54bcDG3dm7DfrxFkz9PZYHEeK iYe4AwHa8vooJp0DdLGHOamvYLTHRH8HPmpdPLayrMN8fU7XbGA12rQ2u3VStNZsCVkq g66DSBYUoVs4AHJU8wp/h5INoAa2wdhYviotWLekYg5ueammOhqcgHIXi2GwT2EtjkiD uGwev1g2ojX2R+3ZYlBvlvuva4Ik3nxqSM9N6b8LOnrwfRAwz58cbJ2OIdut+MvJHtzG nyzCss7jDBHn4aPnEZW7/4VHssLZBidhhUFrKGh5481PL6WdkQBmHSoVT++lDC0cpOJt 2w6A== MIME-Version: 1.0 X-Received: by 10.50.57.166 with SMTP id j6mr4152902igq.21.1362895943253; Sat, 09 Mar 2013 22:12:23 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.153.133 with HTTP; Sat, 9 Mar 2013 22:12:23 -0800 (PST) In-Reply-To: <513C1629.50501@caltel.com> References: <513C1629.50501@caltel.com> Date: Sun, 10 Mar 2013 01:12:23 -0500 X-Google-Sender-Auth: 7XRxo5UqAysIqAivjxj_0tEH3xQ Message-ID: Subject: Re: Aligning MBR for ZFS boot help From: J David To: Cody Ritts Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 06:12:24 -0000 On Sun, Mar 10, 2013 at 12:12 AM, Cody Ritts wrote: > I have a new intel atom appliance that will not boot from a GPT partition > table. It came with an SSD, so I am trying to align it to 1MB for the > erase block size. > I looked and looked and I don't see where you're creating a GPT partition table or indeed doing anything with GPT. You create an MBR table here: > gpart create -s mbr ada0 >> > And seem to stick with it through the rest of your example. If you adjust this to: gpart create -s gpt ada0 You may get better results, because MBR is indeed going to saddle you with cylinder boundaries using some inscrutable probably-fictional geometry. I think you'd want something like gpart add -t freebsd-boot -b 34 -s 128 ada0 gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0 gpart add -b 2048 -s 51G -l zroot -t freebsd-zfs ada0 gpart add -s 8G -t freebsd-swap ada0 But that might need some tweaking. Your zpool will then use the "zroot" partition / ada0p2. Hope that is helpful. From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 06:30:50 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id DBC313FE for ; Sun, 10 Mar 2013 06:30:50 +0000 (UTC) (envelope-from jdavidlists@gmail.com) Received: from mail-ie0-x234.google.com (mail-ie0-x234.google.com [IPv6:2607:f8b0:4001:c03::234]) by mx1.freebsd.org (Postfix) with ESMTP id B2DF4EAC for ; Sun, 10 Mar 2013 06:30:50 +0000 (UTC) Received: by mail-ie0-f180.google.com with SMTP id bn7so3643220ieb.11 for ; Sat, 09 Mar 2013 22:30:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=PPfhhwj4qBZdpZY65kPMNJPcfF5HJlBWcYg3l9WdqZ4=; b=O7xsxN2Anse9VIrq66HDC3SKK345gu65+CxW7RkMKkmIPWSAaP28B5r0MHlh1Yitjr AU9OhQa4CRLl6LkggD8E7MMNIobcZK7FyOATtn5fgVZ6bNOggUgCS0eBjRBAmbkeeuhk g0aDmpchYgOmhZHsn4xO4p1ircledgxx8kxH1CzDRom1rOJZ7JUYEtBfkK0QX66T7Vb2 NIGVNJguBZpB8v1i88Za6lJe3Oxo/hTTWDXifCQtBmvILpikUGEkd4QsmcjAFkbfcDuC CD9d+Eny0Kob9kz/CbasY7wXZr3lOzt10gaNCw3eOQXDZ8xm5cozEu2MNISpVzr2fCUg yARg== MIME-Version: 1.0 X-Received: by 10.42.150.131 with SMTP id a3mr5691655icw.8.1362897050476; Sat, 09 Mar 2013 22:30:50 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.153.133 with HTTP; Sat, 9 Mar 2013 22:30:50 -0800 (PST) In-Reply-To: References: <513C1629.50501@caltel.com> Date: Sun, 10 Mar 2013 01:30:50 -0500 X-Google-Sender-Auth: 7GB0uRqxB3vUfBQTUhypSAnV5hU Message-ID: Subject: Re: Aligning MBR for ZFS boot help From: J David To: Cody Ritts Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 06:30:50 -0000 Just to check myself, I ran this real quick on a virstor: # truncate -s 1G deleteme # mdconfig -a -t vnode -f deleteme md0 # gvirstor label -s 62522712k fakessd md0 Resizing virtual size to be a multiple of chunk size. New virtual size: 61056 MB Resizing virtual size to fit virstor structures. New virtual size: 61184 MB (32 new chunks) # gpart create -s gpt /dev/virstor/fakessd virstor/fakessd created # gpart add -t freebsd-boot -b 34 -s 128 /dev/virstor/fakessd virstor/fakessdp1 added # gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 /dev/virstor/fakessd bootcode written to virstor/fakessd # gpart add -b 2048 -s 51G -l zroot -t freebsd-zfs /dev/virstor/fakessd virstor/fakessdp2 added # gpart add -t freebsd-swap /dev/virstor/fakessd # no -s = use all space left virstor/fakessdp3 added # gpart show /dev/virstor/fakessd => 34 125304765 virstor/fakessd GPT (59G) 34 128 1 freebsd-boot (64k) 162 1886 - free - (943k) 2048 106954752 2 freebsd-zfs (51G) 106956800 18347999 3 freebsd-swap (8.8G) # zpool create zroot /dev/gpt/zroot # zpool status pool: zroot state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 gpt/zroot ONLINE 0 0 0 errors: No known data errors I won't have much luck booting a virstor to test this :) but it sure looks pretty, so hopefully it will work for you. From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 08:22:09 2013 Return-Path: Delivered-To: fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 0CF582CB; Sun, 10 Mar 2013 08:22:09 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au [211.29.132.183]) by mx1.freebsd.org (Postfix) with ESMTP id 640A21F3; Sun, 10 Mar 2013 08:22:07 +0000 (UTC) Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106]) by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r2A8LvSb028477 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 10 Mar 2013 19:21:59 +1100 Date: Sun, 10 Mar 2013 19:21:57 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: "Kenneth D. Merry" Subject: Re: patches to add new stat(2) file flags In-Reply-To: <20130308232155.GA47062@nargothrond.kdm.org> Message-ID: <20130310181127.D2309@besplex.bde.org> References: <20130307000533.GA38950@nargothrond.kdm.org> <20130307222553.P981@besplex.bde.org> <20130308232155.GA47062@nargothrond.kdm.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=bNdOu4CZ c=1 sm=1 a=n2O7wv11oSwA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=YOiZBDKP_E4A:10 a=7RUHTw8TR_SxG-0Q0ScA:9 a=CjuIK1q_8ugA:10 a=iA5AuRVOsPQzuK-W:21 a=yF7AlGMdZxl7LVJH:21 a=TEtd8y5WR3g2ypngnwZWYw==:117 Cc: arch@FreeBSD.org, fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 08:22:09 -0000 On Fri, 8 Mar 2013, Kenneth D. Merry wrote: > On Fri, Mar 08, 2013 at 00:37:15 +1100, Bruce Evans wrote: >> On Wed, 6 Mar 2013, Kenneth D. Merry wrote: >> >>> I have attached diffs against head for some additional stat(2) file flags. >>> >>> The primary purpose of these flags is to improve compatibility with CIFS, >>> both from the client and the server side. >>> ... >> >> I missed looking at the diffs in my previous reply. >> >> % --- //depot/users/kenm/FreeBSD-test3/bin/chflags/chflags.1 2013-03-04 >> 17:51:12.000000000 -0700 >> % +++ /usr/home/kenm/perforce4/kenm/FreeBSD-test3/bin/chflags/chflags.1 >> 2013-03-04 17:51:12.000000000 -0700 >> % --- /tmp/tmp.49594.86 2013-03-06 16:42:43.000000000 -0700 >> % +++ /usr/home/kenm/perforce4/kenm/FreeBSD-test3/bin/chflags/chflags.1 >> 2013-03-06 14:47:25.987128763 -0700 >> % @@ -117,6 +117,16 @@ >> % set the user immutable flag (owner or super-user only) >> % .It Cm uunlnk , uunlink >> % set the user undeletable flag (owner or super-user only) >> % +.It Cm system , usystem >> % +set the Windows system flag (owner or super-user only) >> >> This begins unsorting of the list. > > Fixed. > >> It's not just a Windows flag, since it also works in DOS. > > Fixed. Thanks. Hopefully all the simple bugs are fixed now. >> "Owner or" is too strict for msdosfs, since files can only have a >> single owner so it is controlling access using groups is needed. I >> use owner root and group msdosfs for msdosfs mounts. This works for >> normal operations like open/read/write, but fails for most attributes >> including file flags. msdosfs doesn't support many attributes but >> this change is supposed to add support for 3 new file flags so it would >> be good if it didn't restrict the support to root. > > I wasn't trying to change the existing security model for msdosfs, but if > you've got a suggested patch to fix it I can add that in. I can't think of anything better than making group write permission enough for attributes. msdosfs also has some style bugs in this area. It uses VOP_ACCESS() with VADMIN for the non-VA_UTIMES_NULL case of utimes(), but for all other attributes it hard-codes a direct uid check followed a priv_check_cred() with PRIV_VFS_ADMIN. VADMIN requires even more than user write permission for POSIX file systems and using it unchanged for all the attributes would be even more restrictive unless we changed it, but it would be easier to make it uniformly less restrictive for msdosfs by using it consistently. Oops, that was in the old version of ffs. ffs now has related complications and unnecessary style bugs (verboseness and misformatting) to support ACLs. It now uses VOP_ACCESSX() with VWRITE_ATTRIBUTES for utimes(), and VOP_ACCESSX() with other VFOO for all attributes except flags. It still uses VOP_ACCESS() with VADMIN() for flags. >> ... >> % .It Dv SF_ARCHIVED >> ... >> % +Filesystems in FreeBSD may or may not have special handling for this >> flag. >> % +For instance, ZFS tracks changes to files and will clear this bit when a >> % +file is updated. >> % +UFS only stores the flag, and relies on the application to change it when >> % +needed. >> >> I think that is useless, since changing it is needed whenever the file >> changes, and applications can do that (short of running as daemons and >> watching for changes). > > Do you mean applications can't do that or can? Oops, can't. It is still hard for users to know how their file system supports. Even programmers don't know that it is backwards :-). >> % --- //depot/users/kenm/FreeBSD-test3/sys/fs/msdosfs/msdosfs_vnops.c >> 2013-03-04 17:51:12.000000000 -0700 >> % +++ >> /usr/home/kenm/perforce4/kenm/FreeBSD-test3/sys/fs/msdosfs/msdosfs_vnops.c >> 2013-03-04 17:51:12.000000000 -0700 >> % --- /tmp/tmp.49594.370 2013-03-06 16:42:43.000000000 -0700 >> % +++ >> /usr/home/kenm/perforce4/kenm/FreeBSD-test3/sys/fs/msdosfs/msdosfs_vnops.c >> 2013-03-06 14:49:47.179130318 -0700 >> % @@ -345,8 +345,17 @@ >> % vap->va_birthtime.tv_nsec = 0; >> % } >> % vap->va_flags = 0; >> % + /* >> % + * The DOS Archive attribute means that a file needs to be >> % + * archived. The BSD SF_ARCHIVED attribute means that a file has >> % + * been archived. Thus the inversion here. >> % + */ >> >> No need to document it again. It goes without saying that ARCHIVE >> != ARCHIVED. > > I disagree. It wasn't immediately obvious to me that SF_ARCHIVED was > generally used as the inverse of the DOS Archived bit until I started > digging into this. If this helps anyone figure that out more quickly, it's > useful. The surprising thing is that it is backwards in FreeBSD and not really supported except in msdosfs. Now several file systems have the comment about it being inverted, but man pages still don't. >> % @@ -420,12 +429,21 @@ >> % if (error) >> % return (error); >> % } >> >> The permissions check before this is delicate and was broken and is >> more broken now. It is still short-circuit to handle setting the >> single flag that used to be supported, and is slightly broken for that: >> - unprivileged user asks to set ARCHIVE by passing !SF_ARCHIVED. We >> allow that, although this may toggle the flag and normal semantics >> for SF flags is to not allow toggling. >> - unprivileged user asks to clear ARCHIVE by passing SF_ARCHIVED. We >> don't allow that. But we should allow preserving ARCHIVE if it is >> already clear. >> The bug wasn't very important when only 1 flag was supported. Now it >> prevents unprivileged users managing the new UF flags if ARCHIVE is >> clear. Fortunately, this is the unusual case. Anyway, unprivileged >> users can set ARCHIVE by doing some other operation. Even the chflags() >> operation should set ARCHIVE and thus allow further chflags()'s that now >> keep ARCHIVE set. Except it is very confusing if a chflags() asks for >> ARCHIVE to be clear. This request might be just to try to preserve >> the current setting and not want it if other things are changed, or >> it might be to purposely clear it. Changing it from set to clear should >> still be privileged. > > I changed it to allow setting or clearing SF_ARCHIVED. Now I can set or > clear the flag as non-root: Actually, it seems OK, since there are no old or new SF_ immututable flags. Some of the actions are broken in the old and new code for directories -- see below. >> See the more complicated permissions check in ffs. It would be safest >> to duplicate most of it, to get different permissions checking for the >> SF and UF flags. Then decide if we want to keep allowing setting >> ARCHIVE without privilege. > > I think we should allow getting and setting SF_ARCHIVED without special > privileges. Given how it is generally used, I don't think it should be > restricted to the super-user. I don't really like that since changing the flags is mainly needed for the failry privileged operation of managing other OS's file systems. However, since we're mapping the SYSTEM flag to a UF_ flag, the SYSTEM flag will require less privilege than the ARCHIVE flag. This is backwards, so we might as well require less privilege for ARCHIVE too. I think we, that is, you should use a new UF_ARCHIVE flag with the correct sense. > Can you provide some code demonstrating how the permissions code should > be changed in msdosfs? I don't know that much about that sort of thing, > so I'll probably spend an inordinate amount of time stumbling > through it. Now I think only cleanups are needed. >> % return EOPNOTSUPP; >> % if (vap->va_flags & SF_ARCHIVED) >> % dep->de_Attributes &= ~ATTR_ARCHIVE; >> % else if (!(dep->de_Attributes & ATTR_DIRECTORY)) >> % dep->de_Attributes |= ATTR_ARCHIVE; >> >> The comment before this says that we ignore attmps to set ATTR_ARCHIVED >> for directories. However, it is out of date. WinXP allows setting it >> and all the new flags for directories, and so do we. > > Do you mean we allow setting it in UFS, or where? Obviously the code above > won't set it on a directory. I meant it here. Actually, the comment matches the code -- I somehow missed the test in the code. However, the code is wrong. All directories except the root directory have this and other attributes, but FreeBSD refuses to set them. More below. >> The WinXP attrib command (at least on a FAT32 fs) doesn't allow setting >> or clearing ARCHIVE (even if it is already set or clear) if any of >> HIDDEN, READONLY or SYSTEM is already set and remains set after the >> command. Thus the HRS attributes act a bit like immutable flags, but >> subtly differently. (ffs has the quite different and worse behaviour >> of allowing chflags() irrespective of immutable flags being set before >> or after, provided there is enough privilege to change the immutable >> flags.) Anyway, they should all give some aspects of immutability. > > We could do that for msdosfs, but why add more things for the user to trip > over given how the filesystem is typically used? Most people probably > use it for USB thumb drives these days. Or perahps on a dual boot system > to access their Windows partition. The small data drives won't have many files with attributes (except ARCHIVE). For multiple-boot, I think the permssions shouldn't be too much different than the foreign OS's. I used not to worry about this and liked deleting WinXP files without asking it, but recently I spent a lot of time recovering a WinXP ntfs partition and changed a bit too much using FreeBSD and Cygwin because I didn't understand the permissions (especially ACLs). ntfs in FreeBSD was less than r/o so it couldn't even back up the permissions (for file flags, it returned the garbage in its internal inode flags without translation...). > *** src/bin/chflags/chflags.1.orig > --- src/bin/chflags/chflags.1 > *************** > *** 101,120 **** > .Bl -tag -offset indent -width ".Cm opaque" > .It Cm arch , archived > set the archived flag (super-user only) > .It Cm opaque > set the opaque flag (owner or super-user only) > - .It Cm nodump > - set the nodump flag (owner or super-user only) > .It Cm sappnd , sappend The opaque flag is UF_ too. > + .It Cm snapshot > + set the snapshot flag (most filesystems do not allow changing this flag) I think none do. It can only be displayed. chflags(1) doesn't display flags, so this shouldn't be here. The problem is that this man page is the only place where the flag names are documented. ls(1) and strtofflags(3) just point to here. strtofflags(3) says that the flag names are documented here, but ls(1) just has an Xref to here. > *** src/lib/libc/sys/chflags.2.orig > --- src/lib/libc/sys/chflags.2 > --- 71,127 ---- > the following values > .Pp > .Bl -tag -width ".Dv SF_IMMUTABLE" -compact -offset indent > ! .It Dv SF_APPEND > The file may only be appended to. > .It Dv SF_ARCHIVED > ! The file has been archived. > ! This flag means the opposite of the Windows and CIFS FILE_ATTRIBUTE_ARCHIVE DOS, Windows and CIFS... > ! attribute. > ! That attribute means that the file should be archived, whereas > ! .Dv SF_ARCHIVED > ! means that the file has been archived. > ! Filesystems in FreeBSD may or may not have special handling for this flag. > ! For instance, ZFS tracks changes to files and will clear this bit when a > ! file is updated. Does zfs clear it in other circumstances? WinXP doesn't for msdosfs (or ntfs?), but FreeBSD clears it when changing some attributes, even for null changes (these are: times except for atimes, and the HIDDEN attribute when it is changed by chmod() -- even for null changes --, but not for the HIDDEN attribute when it is changed (or preserved) by chflags() in your new code). I want to to be cleared for metadata so that backup utilities can trust the ARCHIVE flag for metadata changes. > + .It Dv UF_IMMUTABLE > + The file may not be changed. > + Filesystems may use this flag to maintain compatibility with the Windows and > + CIFS FILE_ATTRIBUTE_READONLY attribute. So READONLY is only mapped to UFS_IMMUTABLE if it gives immutability? > *** src/sys/fs/msdosfs/msdosfs_vnops.c.orig > --- src/sys/fs/msdosfs/msdosfs_vnops.c > *************** > *** 415,431 **** > * set ATTR_ARCHIVE for directories `cp -pr' from a more > * sensible filesystem attempts it a lot. > */ > ! if (vap->va_flags & SF_SETTABLE) { > error = priv_check_cred(cred, PRIV_VFS_SYSFLAGS, 0); > if (error) > return (error); > } > ! if (vap->va_flags & ~SF_ARCHIVED) > return EOPNOTSUPP; > if (vap->va_flags & SF_ARCHIVED) > dep->de_Attributes &= ~ATTR_ARCHIVE; > else if (!(dep->de_Attributes & ATTR_DIRECTORY)) > dep->de_Attributes |= ATTR_ARCHIVE; > dep->de_flag |= DE_MODIFIED; > } > > --- 424,448 ---- > * set ATTR_ARCHIVE for directories `cp -pr' from a more > * sensible filesystem attempts it a lot. > */ > ! if (vap->va_flags & (SF_SETTABLE & ~(SF_ARCHIVED))) { Excessive parentheses. > error = priv_check_cred(cred, PRIV_VFS_SYSFLAGS, 0); > if (error) > return (error); > } VADMIN is still needed, and that is too strict. This is a general problem and should be fixed separately. > ! if (vap->va_flags & ~(SF_ARCHIVED | UF_HIDDEN | UF_SYSTEM)) > return EOPNOTSUPP; > if (vap->va_flags & SF_ARCHIVED) > dep->de_Attributes &= ~ATTR_ARCHIVE; > else if (!(dep->de_Attributes & ATTR_DIRECTORY)) > dep->de_Attributes |= ATTR_ARCHIVE; > + if (vap->va_flags & UF_HIDDEN) > + dep->de_Attributes |= ATTR_HIDDEN; > + else > + dep->de_Attributes &= ~ATTR_HIDDEN; > + if (vap->va_flags & UF_SYSTEM) > + dep->de_Attributes |= ATTR_SYSTEM; > + else > + dep->de_Attributes &= ~ATTR_SYSTEM; > dep->de_flag |= DE_MODIFIED; > } Technical old and new problems with msdosfs: - all directories except the root directory support the 3 attributes handled above, and READONLY - the special case for the root directory is because before FAT32, the root directory didn't have an entry for itself (and was otherwise special). With FAT32, the root directory is not so special, but still doesn't have an entry for itself. - thus the old code in the above is wrong for all directories except the root directory - thus the new code in the above is wrong for the root directory. It will make changes to the in-core denode. These can be seen by stat() for a while, but go away when the vnode is recycled. - other code is wrong for directories too. deupdat() refuses to convert from the in-core denode to the disk directory entry for directories. So even when the above changes values for directories, the changes only get synced to the disk accidentally when there is a large change (such as for extending the directory), to the directory entry. - being the root directory is best tested for using VV_ROOT. I use the following to fix the corresponding bugs in utimes(): /* Was: silently ignore the non-error or error for all dirs. */ if (DETOV(dep)->v_vflag & VV_ROOT) return (EINVAL); /* Otherwise valid. */ deupdat() needs a similar change to not ignore all directories. Bruce From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 15:57:37 2013 Return-Path: Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id CFF4C4E6; Sun, 10 Mar 2013 15:57:37 +0000 (UTC) (envelope-from mckusick@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id A10F2350; Sun, 10 Mar 2013 15:57:37 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r2AFvb9H065897; Sun, 10 Mar 2013 15:57:37 GMT (envelope-from mckusick@freefall.freebsd.org) Received: (from mckusick@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r2AFva5E065896; Sun, 10 Mar 2013 15:57:36 GMT (envelope-from mckusick) Date: Sun, 10 Mar 2013 15:57:36 GMT Message-Id: <201303101557.r2AFva5E065896@freefall.freebsd.org> To: kvedulv@kvedulv.de, mckusick@FreeBSD.org, freebsd-fs@FreeBSD.org From: mckusick@FreeBSD.org Subject: Re: kern/162362: [snapshots] [panic] ufs with snapshot(s) panics when getting full X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 15:57:37 -0000 Synopsis: [snapshots] [panic] ufs with snapshot(s) panics when getting full State-Changed-From-To: open->closed State-Changed-By: mckusick State-Changed-When: Sun Mar 10 15:57:04 UTC 2013 State-Changed-Why: Closed at the request of the submitter. http://www.freebsd.org/cgi/query-pr.cgi?pr=162362 From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 16:22:43 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 9CEC199F for ; Sun, 10 Mar 2013 16:22:43 +0000 (UTC) (envelope-from wblock@wonkity.com) Received: from wonkity.com (wonkity.com [67.158.26.137]) by mx1.freebsd.org (Postfix) with ESMTP id 62D90639 for ; Sun, 10 Mar 2013 16:22:43 +0000 (UTC) Received: from wonkity.com (localhost [127.0.0.1]) by wonkity.com (8.14.6/8.14.6) with ESMTP id r2AGMfa6006199; Sun, 10 Mar 2013 10:22:41 -0600 (MDT) (envelope-from wblock@wonkity.com) Received: from localhost (wblock@localhost) by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r2AGMfSc006196; Sun, 10 Mar 2013 10:22:41 -0600 (MDT) (envelope-from wblock@wonkity.com) Date: Sun, 10 Mar 2013 10:22:41 -0600 (MDT) From: Warren Block To: Cody Ritts Subject: Re: Aligning MBR for ZFS boot help In-Reply-To: <513C1629.50501@caltel.com> Message-ID: References: <513C1629.50501@caltel.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (wonkity.com [127.0.0.1]); Sun, 10 Mar 2013 10:22:41 -0600 (MDT) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 16:22:43 -0000 On Sat, 9 Mar 2013, Cody Ritts wrote: > Poking around on the internet, it looks like gpart is possibly enforcing > geometry boundaries? Not gpart, but the kernel. At present, I don't know of any way to use FreeBSD for creating MBR slices aligned to anything other than 63 blocks. FreeBSD partitions can be aligned inside a slice with an offset. Putting ZFS on one of those partitions may be the easiest way to do this. Put the slice at block 2016, then align the first FreeBSD partition inside that slice to 1M and it should land at block 2048. Another option is to create the MBR with aligned slices using another operating system, one that allows deviation from the MBR standard. Ronald Guilmette recently showed an interesting approach of starting the slice at 63M, the least common multiple of 63 and 1M. If the BIOS does not like GPT, check for BIOS updates. And make sure the vendor knows about the problem. From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 16:35:23 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A2CC9CAE for ; Sun, 10 Mar 2013 16:35:23 +0000 (UTC) (envelope-from cr@caltel.com) Received: from mail1.caltel.com (mail1.caltel.com [66.102.144.6]) by mx1.freebsd.org (Postfix) with ESMTP id 6C861698 for ; Sun, 10 Mar 2013 16:35:23 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAOa1PFFCZpCq/2dsb2JhbABDxD+BYHSCJgEBBTg1CxELGAkWCAcJAwIBAgE0ERMGAgEBiA+oVpJzjxUWgyoDiHKNY4Vniw6DKhw X-IPAS-Result: AqAEAOa1PFFCZpCq/2dsb2JhbABDxD+BYHSCJgEBBTg1CxELGAkWCAcJAwIBAgE0ERMGAgEBiA+oVpJzjxUWgyoDiHKNY4Vniw6DKhw X-IronPort-AV: E=Sophos;i="4.84,819,1355126400"; d="scan'208";a="16937060" Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local) ([66.102.144.170]) by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA; 10 Mar 2013 09:34:33 -0700 Message-ID: <513CB619.9050201@caltel.com> Date: Sun, 10 Mar 2013 09:34:33 -0700 From: Cody Ritts Organization: CalTel User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130216 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 16:35:23 -0000 That is the thing, I am struggling with MBR because the computer just completely ignores booting from the default GPT. I spent about 2 hours learning that the hard way the other day. Who would have thought a brand new machine would not boot from a GPT disk? I will be testing if it is just FreeBSDs GPT or others as well. Thanks, Cody On 3/9/13 10:12 PM, J David wrote: > > On Sun, Mar 10, 2013 at 12:12 AM, Cody Ritts > wrote: > > I have a new intel atom appliance that will not boot from a GPT > partition table. It came with an SSD, so I am trying to align it to > 1MB for the erase block size. > > > I looked and looked and I don't see where you're creating a GPT > partition table or indeed doing anything with GPT. You create an MBR > table here: > > gpart create -s mbr ada0 > > > And seem to stick with it through the rest of your example. > > If you adjust this to: > > gpart create -s gpt ada0 > > You may get better results, because MBR is indeed going to saddle you > with cylinder boundaries using some inscrutable probably-fictional geometry. > > I think you'd want something like > > gpart add -t freebsd-boot -b 34 -s 128 ada0 > gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0 > gpart add -b 2048 -s 51G -l zroot -t freebsd-zfs ada0 > gpart add -s 8G -t freebsd-swap ada0 > > But that might need some tweaking. Your zpool will then use the "zroot" > partition / ada0p2. > > Hope that is helpful. > From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 16:53:03 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id DFD15F32 for ; Sun, 10 Mar 2013 16:53:03 +0000 (UTC) (envelope-from cr@caltel.com) Received: from mail1.caltel.com (mail1.caltel.com [66.102.144.6]) by mx1.freebsd.org (Postfix) with ESMTP id C92D472B for ; Sun, 10 Mar 2013 16:53:03 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap8EABi6PFFCZpCq/2dsb2JhbABDxECBYHSCJgEBBAE4QBELGAkWDwkDAgECAUUTCAEBiAkGu0uPFRaDKgOIco1jhWeLDoMqHA X-IPAS-Result: Ap8EABi6PFFCZpCq/2dsb2JhbABDxECBYHSCJgEBBAE4QBELGAkWDwkDAgECAUUTCAEBiAkGu0uPFRaDKgOIco1jhWeLDoMqHA X-IronPort-AV: E=Sophos;i="4.84,819,1355126400"; d="scan'208";a="16937576" Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local) ([66.102.144.170]) by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA; 10 Mar 2013 09:53:03 -0700 Message-ID: <513CBA6E.1090808@caltel.com> Date: Sun, 10 Mar 2013 09:53:02 -0700 From: Cody Ritts Organization: CalTel User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130216 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 16:53:03 -0000 Offsetting the zfs slice, was one of the the first things I tried, but when I boot the loader tells me: > zfsboot: No ZFS Pools located, can't boot I get the feeling that these need to be next to each other dd if=/boot/zfsboot of=/dev/ada0s1 count=1 dd if=/boot/zfsboot of=/dev/ada0s1a skip=1 seek=1024 Good idea on using another fdisk. I will fire up Arch and give it a go. That will also let me test if the system will not boot with any GPT, or of there is something specific to FreeBSDs. Once I isolate it, I see if I can figure out how to make a bug report to Foxconn. And putting things in perspective, 63M out of 65536M is really nbd. I wish I would have thought of that, so simple. I guess my head is still stuck in 1996 when drives were still measured in MB :) Thanks, Cody On 3/10/13 9:22 AM, Warren Block wrote: > On Sat, 9 Mar 2013, Cody Ritts wrote: > >> Poking around on the internet, it looks like gpart is possibly >> enforcing geometry boundaries? > > Not gpart, but the kernel. At present, I don't know of any way to use > FreeBSD for creating MBR slices aligned to anything other than 63 > blocks. FreeBSD partitions can be aligned inside a slice with an > offset. Putting ZFS on one of those partitions may be the easiest way > to do this. Put the slice at block 2016, then align the first FreeBSD > partition inside that slice to 1M and it should land at block 2048. > > Another option is to create the MBR with aligned slices using another > operating system, one that allows deviation from the MBR standard. > Ronald Guilmette recently showed an interesting approach of starting the > slice at 63M, the least common multiple of 63 and 1M. > > If the BIOS does not like GPT, check for BIOS updates. And make sure > the vendor knows about the problem. > From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 19:06:26 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id B2AB7D5 for ; Sun, 10 Mar 2013 19:06:26 +0000 (UTC) (envelope-from cr@caltel.com) Received: from mail1.caltel.com (mail1.caltel.com [66.102.144.6]) by mx1.freebsd.org (Postfix) with ESMTP id 550D6B2F for ; Sun, 10 Mar 2013 19:06:26 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAGnYPFFCZpCq/2dsb2JhbABCwWCCYYFgdIImAQEEAThABgsLGAkWDwkDAgECAUUTCAEBiAkGu1KPFYNAA4hyjWOFZ4sOgyoc X-IPAS-Result: AqAEAGnYPFFCZpCq/2dsb2JhbABCwWCCYYFgdIImAQEEAThABgsLGAkWDwkDAgECAUUTCAEBiAkGu1KPFYNAA4hyjWOFZ4sOgyoc X-IronPort-AV: E=Sophos;i="4.84,819,1355126400"; d="scan'208";a="16945837" Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local) ([66.102.144.170]) by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA; 10 Mar 2013 12:06:19 -0700 Message-ID: <513CD9AB.5080903@caltel.com> Date: Sun, 10 Mar 2013 12:06:19 -0700 From: Cody Ritts Organization: CalTel User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130216 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 19:06:26 -0000 So, aligning to 63MB was still tricky. I found your thread, but I could not find step by step how to calculate the offset. So to put a close to this thread (hopefully), here is how I calculated my MBR alignment. The MBR seems to force partitions to start on a track There are 63 sectors/track Rule of thumb for SSD Erase Blocks is align to 1MB (2048s) I can start a partition on every 63rd sector: 63,126,189,252... etc 63 and 2048 have no common multiples, so 63x2048 = 129024 > root@:/root # gpart add -b 129024 -t freebsd ada0 > ada0s1 added > root@:/root # gpart show ada0 > => 63 125045361 ada0 MBR (59G) > 63 128961 - free - (63M) > 129024 124916400 1 freebsd (59G) YAY, my partition now starts at track 2048. 1MB for SSDs is a rule of thumb and most erase blocks are 128 256 or 512. (so I have read on the internet). Odds are, my SSD has an erase block of 512K or less, so, I can choose a smaller offset: 512K = 1024 sectors 1024*63 = 64512 > root@:/root # gpart add -b 64512 -t freebsd ada0 > ada0s1 added > root@:/root # gpart show ada0 > => 63 125045361 ada0 MBR (59G) > 63 64449 - free - (31M) > 64512 124980912 1 freebsd (59G) Anyway, that is good enough for me. In the end, here are my working commands to create an aligned MBR partition ready for zpool creation and boot. > gpart create -s mbr ada0 > gpart add -b 64512 -t freebsd ada0 > gpart create -s bsd ada0s1 > gpart add -s 52833M -t freebsd-zfs ada0s1 > gpart add -t freebsd-swap ada0s1 > gpart set -a active -i 1 ada0 > gpart bootcode -b /boot/mbr ada0 > dd if=/boot/zfsboot of=/dev/ada0s1 count=1 > dd if=/boot/zfsboot of=/dev/ada0s1a skip=1 seek=1024 Also as related bonus, if you are reading about alignment, here is how to get 4k blocks for your zpool on an SSD or AF/4K hard drive. > glabel label zfs /dev/ada0s1a > gnop create -S 4096 /dev/label/zfs > zpool create -R /mnt tank /dev/label/zfs.nop > zpool export tank > gnop destroy /dev/label/zfs.nop > zpool import -R /mnt -o cachefile=/tmp/zpool.cache system > zdb -U /tmp/zpool.cache | grep ashift > ashift: 12 2^12 = 4096 Thanks, Cody On 3/10/13 9:22 AM, Warren Block wrote: > On Sat, 9 Mar 2013, Cody Ritts wrote: > >> Poking around on the internet, it looks like gpart is possibly >> enforcing geometry boundaries? > > Not gpart, but the kernel. At present, I don't know of any way to use > FreeBSD for creating MBR slices aligned to anything other than 63 > blocks. FreeBSD partitions can be aligned inside a slice with an > offset. Putting ZFS on one of those partitions may be the easiest way > to do this. Put the slice at block 2016, then align the first FreeBSD > partition inside that slice to 1M and it should land at block 2048. > > Another option is to create the MBR with aligned slices using another > operating system, one that allows deviation from the MBR standard. > Ronald Guilmette recently showed an interesting approach of starting the > slice at 63M, the least common multiple of 63 and 1M. > > If the BIOS does not like GPT, check for BIOS updates. And make sure > the vendor knows about the problem. > From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 19:34:51 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 3BBF982E for ; Sun, 10 Mar 2013 19:34:51 +0000 (UTC) (envelope-from wblock@wonkity.com) Received: from wonkity.com (wonkity.com [67.158.26.137]) by mx1.freebsd.org (Postfix) with ESMTP id D8282CE1 for ; Sun, 10 Mar 2013 19:34:50 +0000 (UTC) Received: from wonkity.com (localhost [127.0.0.1]) by wonkity.com (8.14.6/8.14.6) with ESMTP id r2AJYmrI007464; Sun, 10 Mar 2013 13:34:48 -0600 (MDT) (envelope-from wblock@wonkity.com) Received: from localhost (wblock@localhost) by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r2AJYmoi007461; Sun, 10 Mar 2013 13:34:48 -0600 (MDT) (envelope-from wblock@wonkity.com) Date: Sun, 10 Mar 2013 13:34:48 -0600 (MDT) From: Warren Block To: Cody Ritts Subject: Re: Aligning MBR for ZFS boot help In-Reply-To: <513CD9AB.5080903@caltel.com> Message-ID: References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (wonkity.com [127.0.0.1]); Sun, 10 Mar 2013 13:34:48 -0600 (MDT) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 19:34:51 -0000 On Sun, 10 Mar 2013, Cody Ritts wrote: > So, aligning to 63MB was still tricky. I found your thread, but I could not > find step by step how to calculate the offset. Here is the procedure I had in mind: # gpart create -s mbr da0 da0 created root@lightning# gpart add -t freebsd -b 2016 da0 da0s1 added # gpart show da0 => 63 39070017 da0 MBR (18G) 63 1953 - free - (976k) 2016 39068064 1 freebsd (18G) # gpart create -s bsd da0s1 da0s1 created # gpart add -t freebsd-zfs -a 1m da0s1 da0s1a added root@lightning# gpart show da0s1 => 0 39068064 da0s1 BSD (18G) 0 32 - free - (16k) 32 39067648 1 freebsd-zfs (18G) 39067680 384 - free - (192k) The first slice starts at the last CHS-aligned block before 1M, or 2016. Misaligned, but not a problem because nothing will be reading from that location. The freebsd-zfs partition is created, letting gpart align it to 1M. gpart starts the partition at an offset of 32, making it the 1M-aligned block 2048 of the disk. gpart should also be able to install the bootcode correctly, but I have not tried it for MBR and ZFS. From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 19:47:54 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id CB72D928 for ; Sun, 10 Mar 2013 19:47:54 +0000 (UTC) (envelope-from cr@caltel.com) Received: from mail1.caltel.com (mail1.caltel.com [66.102.144.6]) by mx1.freebsd.org (Postfix) with ESMTP id 4066CD3B for ; Sun, 10 Mar 2013 19:47:53 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEACziPFFCZpCq/2dsb2JhbABCwWCCYYFgdIImAQEEAThABgsLGAkWDwkDAgECAUUTCAEBiAkGu12PFRaDKgOIco1jhWeLDoMqHA X-IPAS-Result: AqAEACziPFFCZpCq/2dsb2JhbABCwWCCYYFgdIImAQEEAThABgsLGAkWDwkDAgECAUUTCAEBiAkGu12PFRaDKgOIco1jhWeLDoMqHA X-IronPort-AV: E=Sophos;i="4.84,819,1355126400"; d="scan'208";a="16947068" Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local) ([66.102.144.170]) by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA; 10 Mar 2013 12:47:53 -0700 Message-ID: <513CE369.4030303@caltel.com> Date: Sun, 10 Mar 2013 12:47:53 -0700 From: Cody Ritts Organization: CalTel User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130216 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 19:47:54 -0000 Yeah, just for clarity: > root@lightning# gpart show da0s1 > => 0 39068064 da0s1 BSD (18G) > 0 32 - free - (16k) > 32 39067648 1 freebsd-zfs (18G) > 39067680 384 - free - (192k) and > dd if=/boot/zfsboot of=/dev/da0s1 count=1 > dd if=/boot/zfsboot of=/dev/da0s1a skip=1 seek=1024 will not boot. and will result in: > zfsboot: No ZFS Pools located, can't boot The freebsd-zfs slice cannot have an offset. I tried it several different ways first, since it was the easiest to align, and as soon as I added that offset, the boot strap process would break. Thanks Cody On 3/10/13 12:34 PM, Warren Block wrote: > On Sun, 10 Mar 2013, Cody Ritts wrote: > >> So, aligning to 63MB was still tricky. I found your thread, but I >> could not find step by step how to calculate the offset. > > Here is the procedure I had in mind: > > # gpart create -s mbr da0 > da0 created > root@lightning# gpart add -t freebsd -b 2016 da0 > da0s1 added > # gpart show da0 > => 63 39070017 da0 MBR (18G) > 63 1953 - free - (976k) > 2016 39068064 1 freebsd (18G) > > # gpart create -s bsd da0s1 > da0s1 created > # gpart add -t freebsd-zfs -a 1m da0s1 > da0s1a added > root@lightning# gpart show da0s1 > => 0 39068064 da0s1 BSD (18G) > 0 32 - free - (16k) > 32 39067648 1 freebsd-zfs (18G) > 39067680 384 - free - (192k) > > The first slice starts at the last CHS-aligned block before 1M, or 2016. > Misaligned, but not a problem because nothing will be reading from that > location. > > The freebsd-zfs partition is created, letting gpart align it to 1M. > gpart starts the partition at an offset of 32, making it the 1M-aligned > block 2048 of the disk. > > gpart should also be able to install the bootcode correctly, but I have > not tried it for MBR and ZFS. > From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 19:51:20 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 3B703A88 for ; Sun, 10 Mar 2013 19:51:20 +0000 (UTC) (envelope-from wblock@wonkity.com) Received: from wonkity.com (wonkity.com [67.158.26.137]) by mx1.freebsd.org (Postfix) with ESMTP id E5943D60 for ; Sun, 10 Mar 2013 19:51:19 +0000 (UTC) Received: from wonkity.com (localhost [127.0.0.1]) by wonkity.com (8.14.6/8.14.6) with ESMTP id r2AJpHdv007678; Sun, 10 Mar 2013 13:51:18 -0600 (MDT) (envelope-from wblock@wonkity.com) Received: from localhost (wblock@localhost) by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r2AJpHSZ007675; Sun, 10 Mar 2013 13:51:17 -0600 (MDT) (envelope-from wblock@wonkity.com) Date: Sun, 10 Mar 2013 13:51:17 -0600 (MDT) From: Warren Block To: Cody Ritts Subject: Re: Aligning MBR for ZFS boot help In-Reply-To: <513CE369.4030303@caltel.com> Message-ID: References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (wonkity.com [127.0.0.1]); Sun, 10 Mar 2013 13:51:18 -0600 (MDT) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 19:51:20 -0000 On Sun, 10 Mar 2013, Cody Ritts wrote: > Yeah, just for clarity: > >> root@lightning# gpart show da0s1 >> => 0 39068064 da0s1 BSD (18G) >> 0 32 - free - (16k) >> 32 39067648 1 freebsd-zfs (18G) >> 39067680 384 - free - (192k) > > and > >> dd if=/boot/zfsboot of=/dev/da0s1 count=1 >> dd if=/boot/zfsboot of=/dev/da0s1a skip=1 seek=1024 > > will not boot. But is that putting zfsboot in the right place? Try installing zfsboot with gpart. From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 20:00:01 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A26D5BE9; Sun, 10 Mar 2013 20:00:01 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 6B045DA3; Sun, 10 Mar 2013 20:00:01 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id r2AJxkIg047829; Sun, 10 Mar 2013 11:59:50 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <201303101959.r2AJxkIg047829@gw.catspoiler.org> Date: Sun, 10 Mar 2013 12:59:46 -0700 (PDT) From: Don Lewis Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS! To: lev@FreeBSD.org In-Reply-To: <1809201254.20130309160817@serebryakov.spb.ru> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=iso-8859-5 Content-Transfer-Encoding: 8BIT Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 20:00:01 -0000 On 9 Mar, Lev Serebryakov wrote: > Hello, Don. > You wrote 9 ÜÐàâÐ 2013 Ó., 7:03:52: > > >>> But anyway, torrent client is bad benchmark if we start to speak >>> about some real experiments to decide what could be improved in >>> FFS/GEOM stack, as it is not very repeatable. > DL> I seem to recall that you mentioning that the raid5 geom layer is doing > DL> a lot of caching, presumably to coalesce writes. If this causes the > DL> responses to writes to be delayed too much, then the geom layer could > DL> end up starved for writes because the vfs.hirunningspace limit will be > DL> reached. If this happens, you'll see threads waiting on wdrain. You > DL> could also monitor vfs.runningbufspace to see how close it is getting to > DL> the limit. If this is the problem, you might want to try cranking up > Strangely enough, vfs.runningbufspace is always zero, even under > load. That's very odd ... > My geom_raid5 is configured to dealy writes up to 15 seconds... > > DL> Something else to look at is what problems might the delayed write > DL> completion notifications from the drives cause in the raid5 layer > DL> itself. Could that be preventing the raid5 layer from sending other I/O > DL> commands to the drives? Between the time a write command has been sent > Nope. It should not. I'm not sure for 100%, as I picked up these > sources from original author and sources are rather cryptic, but I > could not see any throttling in it. > DL> to a drive and the drive reports the completion of the write, what > DL> happens if something wants to touch that buffer? > > DL> What size writes does the application typically do? What is the UFS > 64K writes, 32K blocksize, 128K stripe size... Now I'm analyzing > traces from this device to understand exact write patterns. It would be interesting to see what percentage of the writes are full stripe versus a partial stripe to see how effective the caching is. The partial stripe writes probably have to read the parity in order to update it. It would also be interesting to monitor the number of commands each drive is handling. In the ahci.c, it looks like it would be ch->numrslots assuming that you aren't using a port multiplier. > DL> blocksize? What is the raid5 stripe size? With this access pattern, > DL> you may get poor results if the stripe size is much greater than the > DL> block and write sizes. > From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 20:37:05 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 693822C5 for ; Sun, 10 Mar 2013 20:37:05 +0000 (UTC) (envelope-from cr@caltel.com) Received: from mail1.caltel.com (mail1.caltel.com [66.102.144.6]) by mx1.freebsd.org (Postfix) with ESMTP id 3620EEC0 for ; Sun, 10 Mar 2013 20:37:04 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap8EACPuPFFCZpCq/2dsb2JhbABCxESBYXSCJgEBBThAEQsYCRYPCQMCAQIBRRMIAQGIDwy7TgSPFRaDKgOIco1jhWeLDoMqHA X-IPAS-Result: Ap8EACPuPFFCZpCq/2dsb2JhbABCxESBYXSCJgEBBThAEQsYCRYPCQMCAQIBRRMIAQGIDwy7TgSPFRaDKgOIco1jhWeLDoMqHA X-IronPort-AV: E=Sophos;i="4.84,819,1355126400"; d="scan'208";a="16947872" Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local) ([66.102.144.170]) by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA; 10 Mar 2013 13:36:55 -0700 Message-ID: <513CEEE7.8090400@caltel.com> Date: Sun, 10 Mar 2013 13:36:55 -0700 From: Cody Ritts Organization: CalTel User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130216 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 20:37:05 -0000 I have never seen ANY reference to installing the /boot/zfsboot with gpart. And if you look at how that ZFS boot code is installed, it is not like any of the other boot codes that gpart does install. I dont even know how you would construct the syntax. the bootcode arguments seem to be GPT and MBR specific, and mutually exclusive. https://wiki.freebsd.org/RootOnZFS/ZFSBootPartition#A_Installing_FreeBSD_to_the_ZFS_filesystem I know that if I leave either of those dd commands out when installing, the system will not boot. If I leave out the ada0s1 boot code, the system just hangs after mbr is run. If I leave out the ada0s1a boot code, I get an error. (I dont remember what) Ultimately, I am aligned and running so I am happy enough. 32MB is an acceptable loss. I am burnt out on struggling with the bootloader. I have been doing it in various ways for DAYS now trying to just figure out ZFS boot at all. There are a lot of broken (or old?) ZFS boot howtos that made the process more difficult than it should be for me because I wasnt luck enough to stumble onto the "correct" articles right off the bat. I do everything the hard way :) Thanks Cody On 3/10/13 12:51 PM, Warren Block wrote: > On Sun, 10 Mar 2013, Cody Ritts wrote: > >> Yeah, just for clarity: >> >>> root@lightning# gpart show da0s1 >>> => 0 39068064 da0s1 BSD (18G) >>> 0 32 - free - (16k) >>> 32 39067648 1 freebsd-zfs (18G) >>> 39067680 384 - free - (192k) >> >> and >> >>> dd if=/boot/zfsboot of=/dev/da0s1 count=1 >>> dd if=/boot/zfsboot of=/dev/da0s1a skip=1 seek=1024 >> >> will not boot. > > But is that putting zfsboot in the right place? Try installing zfsboot > with gpart. > From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 20:43:06 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 44ADE624 for ; Sun, 10 Mar 2013 20:43:06 +0000 (UTC) (envelope-from cross+freebsd@distal.com) Received: from mail.distal.com (mail.distal.com [IPv6:2001:470:e24c:200::ae25]) by mx1.freebsd.org (Postfix) with ESMTP id 1FC5FF0A for ; Sun, 10 Mar 2013 20:43:05 +0000 (UTC) Received: from magrathea.distal.com (magrathea.distal.com [206.138.151.12]) (authenticated bits=0) by mail.distal.com (8.14.3/8.14.3) with ESMTP id r2AKfmNE029825 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO); Sun, 10 Mar 2013 16:41:49 -0400 (EDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Aligning MBR for ZFS boot help From: Chris Ross In-Reply-To: <513CEEE7.8090400@caltel.com> Date: Sun, 10 Mar 2013 16:41:48 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <6A16D732-01B5-43A8-B676-65B9B35C1FDA@distal.com> References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> <513CEEE7.8090400@caltel.com> To: Cody Ritts X-Mailer: Apple Mail (2.1499) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 20:43:06 -0000 On Mar 10, 2013, at 16:36 , Cody Ritts wrote: > I have never seen ANY reference to installing the /boot/zfsboot with = gpart. And if you look at how that ZFS boot code is installed, it is = not like any of the other boot codes that gpart does install. I dont = even know how you would construct the syntax. the bootcode arguments = seem to be GPT and MBR specific, and mutually exclusive. This is a sparc64 reference, but it does install /boot/zfsboot with = gpart. FYI. http://lists.freebsd.org/pipermail/freebsd-sparc64/2012-July/008489.html - Chris From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 21:40:09 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 07D9E5EA for ; Sun, 10 Mar 2013 21:40:09 +0000 (UTC) (envelope-from freebsd@penx.com) Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2]) by mx1.freebsd.org (Postfix) with ESMTP id C514B182 for ; Sun, 10 Mar 2013 21:40:08 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by btw.pki2.com (8.14.6/8.14.5) with ESMTP id r2ALduqo078414; Sun, 10 Mar 2013 14:39:56 -0700 (PDT) (envelope-from freebsd@penx.com) Subject: Re: Aligning MBR for ZFS boot help From: Dennis Glatting To: Warren Block In-Reply-To: References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> Content-Type: text/plain; charset="us-ascii" Date: Sun, 10 Mar 2013 14:39:55 -0700 Message-ID: <1362951595.99445.2.camel@btw.pki2.com> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit X-yoursite-MailScanner-Information: Dennis Glatting X-yoursite-MailScanner-ID: r2ALduqo078414 X-yoursite-MailScanner: Found to be clean X-MailScanner-From: freebsd@penx.com Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 21:40:09 -0000 Sorry for the stupid question but is this issue (and issues) and procedures written up somewhere? On Sun, 2013-03-10 at 13:51 -0600, Warren Block wrote: > On Sun, 10 Mar 2013, Cody Ritts wrote: > > > Yeah, just for clarity: > > > >> root@lightning# gpart show da0s1 > >> => 0 39068064 da0s1 BSD (18G) > >> 0 32 - free - (16k) > >> 32 39067648 1 freebsd-zfs (18G) > >> 39067680 384 - free - (192k) > > > > and > > > >> dd if=/boot/zfsboot of=/dev/da0s1 count=1 > >> dd if=/boot/zfsboot of=/dev/da0s1a skip=1 seek=1024 > > > > will not boot. > > But is that putting zfsboot in the right place? Try installing zfsboot > with gpart. > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 22:57:36 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 51B4A24F for ; Sun, 10 Mar 2013 22:57:36 +0000 (UTC) (envelope-from nowakpl@platinum.linux.pl) Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4]) by mx1.freebsd.org (Postfix) with ESMTP id EE51865A for ; Sun, 10 Mar 2013 22:57:35 +0000 (UTC) Received: by platinum.linux.pl (Postfix, from userid 87) id A3FCA47E1D; Sun, 10 Mar 2013 23:52:22 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl X-Spam-Level: X-Spam-Status: No, score=-1.3 required=3.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.3.2 Received: from [10.255.1.2] (unknown [83.151.38.73]) by platinum.linux.pl (Postfix) with ESMTPA id 09DEB47E11 for ; Sun, 10 Mar 2013 23:52:22 +0100 (CET) Message-ID: <513D0E90.5090105@platinum.linux.pl> Date: Sun, 10 Mar 2013 23:52:00 +0100 From: Adam Nowacki User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130215 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> In-Reply-To: <513C1629.50501@caltel.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 22:57:36 -0000 I don't think zfsboot is aware of BSD disklabel (offsets other than 0 won't boot). Is there any reason you are using BSD disklabel and not two partition MBR? I also don't think there is any merit in aligning to 1MiB. Most ZFS IOs will be aligned to sector size (ashift). Unless ZFS pool is created with higher ashift then the 63 sector offset is as good as any. gpart create -s mbr ada0 gpart add -s 52862M -t freebsd ada0 gpart add -s 8G -t freebsd ada0 gpart bootcode -b /boot/mbr ada0 dd if=/boot/zfsboot of=/dev/ada0s1 count=1 dd if=/boot/zfsboot of=/dev/ada0s1 skip=1 seek=1024 zpool create ... /dev/ada0s1 swapon /dev/ada0s2 If you still want that 1MB alignment then you will have to (as explained by others) align the MBR partition. On 2013-03-10 06:12, Cody Ritts wrote: > Hello all, > > I am really struggling to understand what is going on, if anyone could > tell me where I am going wrong, I would greatly appreciate it. > > I have a new intel atom appliance that will not boot from a GPT > partition table. It came with an SSD, so I am trying to align it to 1MB > for the erase block size. > > All of these commands are being run from a 9.1-RELEASE-amd64-memstick > > > These commands partition the drive, the system boots just fine: >> gpart create -s mbr ada0 >> gpart add -t freebsd ada0 >> gpart create -s bsd ada0s1 >> gpart add -s 52862M -t freebsd-zfs ada0s1 >> gpart add -s 8G -t freebsd-swap ada0s1 >> gpart set -a active -i 1 ada0 >> gpart bootcode -b /boot/mbr ada0 >> dd if=/boot/zfsboot of=/dev/ada0s1 count=1 >> dd if=/boot/zfsboot of=/dev/ada0s1a skip=1 seek=1024 > > This is the gpart print output of those commands >> => 63 125045361 ada0 MBR (59G) >> 63 125045361 1 freebsd [active] (59G) >> >> => 0 125045361 ada0s1 BSD (59G) >> 0 108261376 1 freebsd-zfs (51G) >> 108261376 16777216 2 freebsd-swap (8.0G) >> 125038592 6769 - free - (3.3M) > > Here is my disk info >> root@:/root # diskinfo -v ada0 >> ada0 >> 512 # sectorsize >> 64023257088 # mediasize in bytes (59G) >> 125045424 # mediasize in sectors >> 0 # stripesize >> 0 # stripeoffset >> 124053 # Cylinders according to firmware. >> 16 # Heads according to firmware. >> 63 # Sectors according to firmware. > > 125045361 + 63 = 125045424 > So gpart is for sure printing sectors. > freebsd-zfs starts at sector 63 > > So, I need that freebsd-zfs slice to start at 1MB > 1MB = 2048s > 2048 - 63 = 1985 > so if I add an offset to my slice: >> gpart add -b 1985 -s 52862M -t freebsd-zfs ada0s1 > > should start me at 2048. >> => 63 125045361 ada0 MBR (59G) >> 63 125045361 1 freebsd [active] (59G) >> => 0 125045361 ada0s1 BSD (59G) >> 0 1985 - free - (992k) >> 1985 108261376 1 freebsd-zfs (51G) > > BUT, when i boot, I get this: >> zfsboot: No ZFS Pools located, can't boot > > I think remember reading that freebsd-zfs had to be the first slice (I > cannot remember where i read that). And it apparently does not think an > offset is funny. > > So, that leaves me with trying to adjust my MBR partition, so I start > over and run: >> gpart add -b 1985 -t freebsd ada0 > > but that gives me: >> => 63 125045361 ada0 MBR (59G) >> 63 1953 - free - (976k) >> 2016 125043408 1 freebsd (59G) > > HHHMMMMM???? well, 2016 - 1953 = 63 coincidence? i doubt it, but I > dont get it. > > Poking around on the internet, it looks like gpart is possibly enforcing > geometry boundaries? so I do the following: > >> sysctl kern.geom.part.check_integrity=0 >> root@:/root # gpart add -a 1m -t freebsd ada0 >> ada0s1 added >> root@:/root # gpart show >> => 63 125045361 ada0 MBR (59G) >> 63 2016 - free - (1M) >> 2079 125042652 1 freebsd (59G) >> 125044731 693 - free - (346k) > > Obviously still didnt work. > > > I try a 10MB offset. > 10MB = 20480s > 20480-63 = 20417s >> gpart add -b 20417 -t freebsd ada0 >> => 63 125045361 ada0 MBR (59G) >> 63 20412 - free - (10M) >> 20475 125024949 1 freebsd (59G) > > It is still just a few sectors off. So what if i let gpart > automatically align it. > >> gpart add -a 1m -t freebsd ada0 > >> => 63 125045361 ada0 MBR (59G) >> 63 2016 - free - (1M) >> 2079 125042652 1 freebsd (59G) >> 125044731 693 - free - (346k) > > > And 2079 is still != 2048. > > I have tried adjusting those numbers one by one, and it just hops around > the number I am looking for. I have tried adding partitions in-front of > it, setting the alignment to 1s, and adjusting the size. I cannot get > it to land on 2048. > > It does boot with the padding in the MBR table, but I don't think it is > aligned. Maybe it is aligned, and I just dont know any better. > > I am at a loss. > > Any suggestions would be greatly appreciated. > > Thanks, > > Cody > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Sun Mar 10 23:59:18 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id E40C3D6E for ; Sun, 10 Mar 2013 23:59:18 +0000 (UTC) (envelope-from wblock@wonkity.com) Received: from wonkity.com (wonkity.com [67.158.26.137]) by mx1.freebsd.org (Postfix) with ESMTP id 9B306827 for ; Sun, 10 Mar 2013 23:59:18 +0000 (UTC) Received: from wonkity.com (localhost [127.0.0.1]) by wonkity.com (8.14.6/8.14.6) with ESMTP id r2ANxHxN009350; Sun, 10 Mar 2013 17:59:17 -0600 (MDT) (envelope-from wblock@wonkity.com) Received: from localhost (wblock@localhost) by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r2ANxHob009347; Sun, 10 Mar 2013 17:59:17 -0600 (MDT) (envelope-from wblock@wonkity.com) Date: Sun, 10 Mar 2013 17:59:17 -0600 (MDT) From: Warren Block To: Adam Nowacki Subject: Re: Aligning MBR for ZFS boot help In-Reply-To: <513D0E90.5090105@platinum.linux.pl> Message-ID: References: <513C1629.50501@caltel.com> <513D0E90.5090105@platinum.linux.pl> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (wonkity.com [127.0.0.1]); Sun, 10 Mar 2013 17:59:17 -0600 (MDT) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 23:59:19 -0000 On Sun, 10 Mar 2013, Adam Nowacki wrote: > I don't think zfsboot is aware of BSD disklabel (offsets other than 0 won't > boot). Is there any reason you are using BSD disklabel and not two partition > MBR? MBR slices created on FreeBSD are forced to CHS alignment, pretty much always misaligned for 4K-block hard drives or SSDs. > I also don't think there is any merit in aligning to 1MiB. Most ZFS IOs will > be aligned to sector size (ashift). Unless ZFS pool is created with higher > ashift then the 63 sector offset is as good as any. If the drive has 4K sectors, that will be misaligned, potentially cutting speeds drastically even if ashift is 9. From owner-freebsd-fs@FreeBSD.ORG Mon Mar 11 00:39:33 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 083632B1 for ; Mon, 11 Mar 2013 00:39:33 +0000 (UTC) (envelope-from wblock@wonkity.com) Received: from wonkity.com (wonkity.com [67.158.26.137]) by mx1.freebsd.org (Postfix) with ESMTP id AFC05974 for ; Mon, 11 Mar 2013 00:39:32 +0000 (UTC) Received: from wonkity.com (localhost [127.0.0.1]) by wonkity.com (8.14.6/8.14.6) with ESMTP id r2B0dU2X009599; Sun, 10 Mar 2013 18:39:30 -0600 (MDT) (envelope-from wblock@wonkity.com) Received: from localhost (wblock@localhost) by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r2B0dUKW009596; Sun, 10 Mar 2013 18:39:30 -0600 (MDT) (envelope-from wblock@wonkity.com) Date: Sun, 10 Mar 2013 18:39:30 -0600 (MDT) From: Warren Block To: Dennis Glatting Subject: Re: Aligning MBR for ZFS boot help In-Reply-To: <1362951595.99445.2.camel@btw.pki2.com> Message-ID: References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> <1362951595.99445.2.camel@btw.pki2.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (wonkity.com [127.0.0.1]); Sun, 10 Mar 2013 18:39:30 -0600 (MDT) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Mar 2013 00:39:33 -0000 On Sun, 10 Mar 2013, Dennis Glatting wrote: > Sorry for the stupid question but is this issue (and issues) and > procedures written up somewhere? The wiki shows using GPT: https://wiki.freebsd.org/RootOnZFS/GPTZFSBoot/9.0-RELEASE The issue with MBR is that FreeBSD strictly follows the standard of alignment to CHS values. However, for the last couple of decades, drives have used variable geometry to fit more data on the outside tracks, so CHS values don't apply any more. Combine this with the new need to align MBR slices to particular values of 4K or 1M, and FreeBSD has a problem. Solution: use GPT when possible. But some systems won't boot from GPT. If MBR partitioning is required, and alignment is needed, use some other operating system to create the MBR, and don't try to edit the slices on FreeBSD. I don't know if anyone has documented all this in one place. If FreeBSD had a way to turn off the strict enforcement of CHS values for MBR, it would make that unnecessary. From owner-freebsd-fs@FreeBSD.ORG Mon Mar 11 04:19:40 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id E08A121F for ; Mon, 11 Mar 2013 04:19:40 +0000 (UTC) (envelope-from jdavidlists@gmail.com) Received: from mail-ie0-x22d.google.com (mail-ie0-x22d.google.com [IPv6:2607:f8b0:4001:c03::22d]) by mx1.freebsd.org (Postfix) with ESMTP id B496D309 for ; Mon, 11 Mar 2013 04:19:40 +0000 (UTC) Received: by mail-ie0-f173.google.com with SMTP id 9so4251761iec.32 for ; Sun, 10 Mar 2013 21:19:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=Dxu8GRKfhgaKjyOXqfsUVmWp9yzwDVq/TOrQjV66Nds=; b=wSI2t8Pp84JbALL1tizqWjHdOAMxgbvcZcR5TC2naSL8V9eEs4YBiai0EZf12cNfwc v1d1+EjpDBxJ7VMXuZmJfXxCtH78i95JdSBw7Ib0hUMVkGVZ4XYbIRw05B2yytDUjRfe eakYl+vWmJBtBaU/AX5nUFG6zATJXUk1CtR6ZMcvv+WTvit+Kz6NHmUkGwMCy0/pRWDJ BzE+4J583fEiZNWsYDvyQUvJUhoBfkDjP0RnpAuCwbt8TiD27w/GdAXphjoW3ycQVMMY njsnJasc4lmVqv39IPOoNpaQXf7D0kyNja68Rbcn2c/neXWtrR60Rymg6sGqeo9YGZZD 9rEA== MIME-Version: 1.0 X-Received: by 10.42.150.131 with SMTP id a3mr7347785icw.8.1362975580465; Sun, 10 Mar 2013 21:19:40 -0700 (PDT) Sender: jdavidlists@gmail.com Received: by 10.42.153.133 with HTTP; Sun, 10 Mar 2013 21:19:40 -0700 (PDT) In-Reply-To: References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> <1362951595.99445.2.camel@btw.pki2.com> Date: Mon, 11 Mar 2013 00:19:40 -0400 X-Google-Sender-Auth: eGnFIBBt0Gm9FcXDKjkiA8kgoS8 Message-ID: Subject: Re: Aligning MBR for ZFS boot help From: J David To: Warren Block Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Mar 2013 04:19:40 -0000 On Sun, Mar 10, 2013 at 8:39 PM, Warren Block wrote: > If FreeBSD had a way to turn off the strict enforcement of CHS values for > MBR, it would make that unnecessary. > The solution to that for a drive of this size *might* be to use fdisk instead of gpart to do the partitioning, and just flat out lie to it about the geometry when it asks you if you want to use something other than what it read. (Which was also a lie anyway.) From owner-freebsd-fs@FreeBSD.ORG Mon Mar 11 07:47:18 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 1961714B; Mon, 11 Mar 2013 07:47:18 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 859EFAFA; Mon, 11 Mar 2013 07:47:17 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r2B7lD7w092264; Mon, 11 Mar 2013 09:47:13 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.0 kib.kiev.ua r2B7lD7w092264 Received: (from kostik@localhost) by tom.home (8.14.6/8.14.6/Submit) id r2B7lDQg092263; Mon, 11 Mar 2013 09:47:13 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 11 Mar 2013 09:47:13 +0200 From: Konstantin Belousov To: mckusick@FreeBSD.org Subject: Re: kern/162362: [snapshots] [panic] ufs with snapshot(s) panics when getting full Message-ID: <20130311074713.GO3794@kib.kiev.ua> References: <201303101557.r2AFva5E065896@freefall.freebsd.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="8CLqiYUo6qiluZPB" Content-Disposition: inline In-Reply-To: <201303101557.r2AFva5E065896@freefall.freebsd.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@FreeBSD.org, pho@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Mar 2013 07:47:18 -0000 --8CLqiYUo6qiluZPB Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Mar 10, 2013 at 03:57:36PM +0000, mckusick@FreeBSD.org wrote: > Synopsis: [snapshots] [panic] ufs with snapshot(s) panics when getting fu= ll >=20 > State-Changed-From-To: open->closed > State-Changed-By: mckusick > State-Changed-When: Sun Mar 10 15:57:04 UTC 2013 > State-Changed-Why:=20 > Closed at the request of the submitter. >=20 > http://www.freebsd.org/cgi/query-pr.cgi?pr=3D162362 This is known and still unresolved issue. It is reproducable on HEAD as well. --8CLqiYUo6qiluZPB Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJRPYwBAAoJEJDCuSvBvK1BatgP/3xU6T5cJW0fuNWGJP1HFw7Q JiC/DiqHyCse7PG+FFTlY7ACw/U1oYHu2tvCzE2q4AMwOCzz3Wup503VcVa0OuC3 BSKDDiqF6HnLI3a6XC0bX3R34BTbAzAAxUbap1O1GtEEAhiUsYFOadyYD+WK8sTz ASuFV6VGBCX6tFtP77RO3B67Hb6RW1m9L+BAPKnaMAN85Fdgau71/rnxVEGkRrHl X48a7a9H3OyD3L5RC/a0NKe/gFHUOQ7S64hyIook0yoeakmRD1+Rt4ZgVlsqkw+W c4CN80VcR4pHMamdG4LRAxypAl4cgzImm13xoennKjgp9Tt85ZSAaLI6r1yC5SRS /vS9VVr27SPnyX7dqySpaCYVs36K+o7LOhhog8rPEph8C8L6Z+qWmzZjbnnTrifD oX2zFPWcxpte16mCGfc+GevYNc2SmZ17zOqt9yWYbbPoMCA7adfs+nMtLv93eGVX /0zpyCgQikRk/tyLwHySSNeU4RRS3vayDECsVJceQLgAGSBXWMIKt6B1r/l3d8bq 6gw2heWtSLlCK6bJuKTkf48Lp/+zc2uH05dhIUTiAezEn1+IVAowZP0OZujUSg5Z O0Tb2+GBe4fLNVmlgLc3+lxxBsAkwy00eFCGU+u5PlkQAXrxqIJULzhxAzJ8EbjP wh7nttKcevl42LRn4Ihe =KM3W -----END PGP SIGNATURE----- --8CLqiYUo6qiluZPB-- From owner-freebsd-fs@FreeBSD.ORG Mon Mar 11 11:06:42 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 0859CA8E for ; Mon, 11 Mar 2013 11:06:42 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id EDAB77C4 for ; Mon, 11 Mar 2013 11:06:41 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r2BB6fsV088967 for ; Mon, 11 Mar 2013 11:06:41 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r2BB6fRd088965 for freebsd-fs@FreeBSD.org; Mon, 11 Mar 2013 11:06:41 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 11 Mar 2013 11:06:41 GMT Message-Id: <201303111106.r2BB6fRd088965@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-fs@FreeBSD.org Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Mar 2013 11:06:42 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o bin/176253 fs zpool(8): zfs pool indentation is misleading/wrong o kern/176141 fs [zfs] sharesmb=on makes errors for sharenfs, and still o kern/175950 fs [zfs] Possible deadlock in zfs after long uptime o kern/175897 fs [zfs] operations on readonly zpool hang o kern/175179 fs [zfs] ZFS may attach wrong device on move o kern/175071 fs [ufs] [panic] softdep_deallocate_dependencies: unrecov o kern/174372 fs [zfs] Pagefault appears to be related to ZFS o kern/174315 fs [zfs] chflags uchg not supported o kern/174310 fs [zfs] root point mounting broken on CURRENT with multi o kern/174279 fs [ufs] UFS2-SU+J journal and filesystem corruption o kern/174060 fs [ext2fs] Ext2FS system crashes (buffer overflow?) o kern/173830 fs [zfs] Brain-dead simple change to ZFS error descriptio o kern/173718 fs [zfs] phantom directory in zraid2 pool f kern/173657 fs [nfs] strange UID map with nfsuserd o kern/173363 fs [zfs] [panic] Panic on 'zpool replace' on readonly poo o kern/173136 fs [unionfs] mounting above the NFS read-only share panic o kern/172348 fs [unionfs] umount -f of filesystem in use with readonly o kern/172334 fs [unionfs] unionfs permits recursive union mounts; caus o kern/171626 fs [tmpfs] tmpfs should be noisier when the requested siz o kern/171415 fs [zfs] zfs recv fails with "cannot receive incremental o kern/170945 fs [gpt] disk layout not portable between direct connect o bin/170778 fs [zfs] [panic] FreeBSD panics randomly o kern/170680 fs [nfs] Multiple NFS Client bug in the FreeBSD 7.4-RELEA o kern/170497 fs [xfs][panic] kernel will panic whenever I ls a mounted o kern/169945 fs [zfs] [panic] Kernel panic while importing zpool (afte o kern/169480 fs [zfs] ZFS stalls on heavy I/O o kern/169398 fs [zfs] Can't remove file with permanent error o kern/169339 fs panic while " : > /etc/123" o kern/169319 fs [zfs] zfs resilver can't complete o kern/168947 fs [nfs] [zfs] .zfs/snapshot directory is messed up when o kern/168942 fs [nfs] [hang] nfsd hangs after being restarted (not -HU o kern/168158 fs [zfs] incorrect parsing of sharenfs options in zfs (fs o kern/167979 fs [ufs] DIOCGDINFO ioctl does not work on 8.2 file syste o kern/167977 fs [smbfs] mount_smbfs results are differ when utf-8 or U o kern/167688 fs [fusefs] Incorrect signal handling with direct_io o kern/167685 fs [zfs] ZFS on USB drive prevents shutdown / reboot o kern/167612 fs [portalfs] The portal file system gets stuck inside po o kern/167272 fs [zfs] ZFS Disks reordering causes ZFS to pick the wron o kern/167260 fs [msdosfs] msdosfs disk was mounted the second time whe o kern/167109 fs [zfs] [panic] zfs diff kernel panic Fatal trap 9: gene o kern/167105 fs [nfs] mount_nfs can not handle source exports wiht mor o kern/167067 fs [zfs] [panic] ZFS panics the server o kern/167065 fs [zfs] boot fails when a spare is the boot disk o kern/167048 fs [nfs] [patch] RELEASE-9 crash when using ZFS+NULLFS+NF o kern/166912 fs [ufs] [panic] Panic after converting Softupdates to jo o kern/166851 fs [zfs] [hang] Copying directory from the mounted UFS di o kern/166477 fs [nfs] NFS data corruption. o kern/165950 fs [ffs] SU+J and fsck problem o kern/165521 fs [zfs] [hang] livelock on 1 Gig of RAM with zfs when 31 o kern/165392 fs Multiple mkdir/rmdir fails with errno 31 o kern/165087 fs [unionfs] lock violation in unionfs o kern/164472 fs [ufs] fsck -B panics on particular data inconsistency o kern/164370 fs [zfs] zfs destroy for snapshot fails on i386 and sparc o kern/164261 fs [nullfs] [patch] fix panic with NFS served from NULLFS o kern/164256 fs [zfs] device entry for volume is not created after zfs o kern/164184 fs [ufs] [panic] Kernel panic with ufs_makeinode o kern/163801 fs [md] [request] allow mfsBSD legacy installed in 'swap' o kern/163770 fs [zfs] [hang] LOR between zfs&syncer + vnlru leading to o kern/163501 fs [nfs] NFS exporting a dir and a subdir in that dir to o kern/162944 fs [coda] Coda file system module looks broken in 9.0 o kern/162860 fs [zfs] Cannot share ZFS filesystem to hosts with a hyph o kern/162751 fs [zfs] [panic] kernel panics during file operations o kern/162591 fs [nullfs] cross-filesystem nullfs does not work as expe o kern/162519 fs [zfs] "zpool import" relies on buggy realpath() behavi o kern/161968 fs [zfs] [hang] renaming snapshot with -r including a zvo o kern/161864 fs [ufs] removing journaling from UFS partition fails on o bin/161807 fs [patch] add option for explicitly specifying metadata o kern/161579 fs [smbfs] FreeBSD sometimes panics when an smb share is o kern/161533 fs [zfs] [panic] zfs receive panic: system ioctl returnin o kern/161438 fs [zfs] [panic] recursed on non-recursive spa_namespace_ o kern/161424 fs [nullfs] __getcwd() calls fail when used on nullfs mou o kern/161280 fs [zfs] Stack overflow in gptzfsboot o kern/161205 fs [nfs] [pfsync] [regression] [build] Bug report freebsd o kern/161169 fs [zfs] [panic] ZFS causes kernel panic in dbuf_dirty o kern/161112 fs [ufs] [lor] filesystem LOR in FreeBSD 9.0-BETA3 o kern/160893 fs [zfs] [panic] 9.0-BETA2 kernel panic o kern/160860 fs [ufs] Random UFS root filesystem corruption with SU+J o kern/160801 fs [zfs] zfsboot on 8.2-RELEASE fails to boot from root-o o kern/160790 fs [fusefs] [panic] VPUTX: negative ref count with FUSE o kern/160777 fs [zfs] [hang] RAID-Z3 causes fatal hang upon scrub/impo o kern/160706 fs [zfs] zfs bootloader fails when a non-root vdev exists o kern/160591 fs [zfs] Fail to boot on zfs root with degraded raidz2 [r o kern/160410 fs [smbfs] [hang] smbfs hangs when transferring large fil o kern/160283 fs [zfs] [patch] 'zfs list' does abort in make_dataset_ha o kern/159930 fs [ufs] [panic] kernel core o kern/159402 fs [zfs][loader] symlinks cause I/O errors o kern/159357 fs [zfs] ZFS MAXNAMELEN macro has confusing name (off-by- o kern/159356 fs [zfs] [patch] ZFS NAME_ERR_DISKLIKE check is Solaris-s o kern/159351 fs [nfs] [patch] - divide by zero in mountnfs() o kern/159251 fs [zfs] [request]: add FLETCHER4 as DEDUP hash option o kern/159077 fs [zfs] Can't cd .. with latest zfs version o kern/159048 fs [smbfs] smb mount corrupts large files o kern/159045 fs [zfs] [hang] ZFS scrub freezes system o kern/158839 fs [zfs] ZFS Bootloader Fails if there is a Dead Disk o kern/158802 fs amd(8) ICMP storm and unkillable process. o kern/158231 fs [nullfs] panic on unmounting nullfs mounted over ufs o f kern/157929 fs [nfs] NFS slow read o kern/157399 fs [zfs] trouble with: mdconfig force delete && zfs strip o kern/157179 fs [zfs] zfs/dbuf.c: panic: solaris assert: arc_buf_remov o kern/156797 fs [zfs] [panic] Double panic with FreeBSD 9-CURRENT and o kern/156781 fs [zfs] zfs is losing the snapshot directory, p kern/156545 fs [ufs] mv could break UFS on SMP systems o kern/156193 fs [ufs] [hang] UFS snapshot hangs && deadlocks processes o kern/156039 fs [nullfs] [unionfs] nullfs + unionfs do not compose, re o kern/155615 fs [zfs] zfs v28 broken on sparc64 -current o kern/155587 fs [zfs] [panic] kernel panic with zfs p kern/155411 fs [regression] [8.2-release] [tmpfs]: mount: tmpfs : No o kern/155199 fs [ext2fs] ext3fs mounted as ext2fs gives I/O errors o bin/155104 fs [zfs][patch] use /dev prefix by default when importing o kern/154930 fs [zfs] cannot delete/unlink file from full volume -> EN o kern/154828 fs [msdosfs] Unable to create directories on external USB o kern/154491 fs [smbfs] smb_co_lock: recursive lock for object 1 p kern/154228 fs [md] md getting stuck in wdrain state o kern/153996 fs [zfs] zfs root mount error while kernel is not located o kern/153753 fs [zfs] ZFS v15 - grammatical error when attempting to u o kern/153716 fs [zfs] zpool scrub time remaining is incorrect o kern/153695 fs [patch] [zfs] Booting from zpool created on 4k-sector o kern/153680 fs [xfs] 8.1 failing to mount XFS partitions o kern/153418 fs [zfs] [panic] Kernel Panic occurred writing to zfs vol o kern/153351 fs [zfs] locking directories/files in ZFS o bin/153258 fs [patch][zfs] creating ZVOLs requires `refreservation' s kern/153173 fs [zfs] booting from a gzip-compressed dataset doesn't w o bin/153142 fs [zfs] ls -l outputs `ls: ./.zfs: Operation not support o kern/153126 fs [zfs] vdev failure, zpool=peegel type=vdev.too_small o kern/152022 fs [nfs] nfs service hangs with linux client [regression] o kern/151942 fs [zfs] panic during ls(1) zfs snapshot directory o kern/151905 fs [zfs] page fault under load in /sbin/zfs o bin/151713 fs [patch] Bug in growfs(8) with respect to 32-bit overfl o kern/151648 fs [zfs] disk wait bug o kern/151629 fs [fs] [patch] Skip empty directory entries during name o kern/151330 fs [zfs] will unshare all zfs filesystem after execute a o kern/151326 fs [nfs] nfs exports fail if netgroups contain duplicate o kern/151251 fs [ufs] Can not create files on filesystem with heavy us o kern/151226 fs [zfs] can't delete zfs snapshot o kern/150503 fs [zfs] ZFS disks are UNAVAIL and corrupted after reboot o kern/150501 fs [zfs] ZFS vdev failure vdev.bad_label on amd64 o kern/150390 fs [zfs] zfs deadlock when arcmsr reports drive faulted o kern/150336 fs [nfs] mountd/nfsd became confused; refused to reload n o kern/149208 fs mksnap_ffs(8) hang/deadlock o kern/149173 fs [patch] [zfs] make OpenSolaris installa o kern/149015 fs [zfs] [patch] misc fixes for ZFS code to build on Glib o kern/149014 fs [zfs] [patch] declarations in ZFS libraries/utilities o kern/149013 fs [zfs] [patch] make ZFS makefiles use the libraries fro o kern/148504 fs [zfs] ZFS' zpool does not allow replacing drives to be o kern/148490 fs [zfs]: zpool attach - resilver bidirectionally, and re o kern/148368 fs [zfs] ZFS hanging forever on 8.1-PRERELEASE o kern/148138 fs [zfs] zfs raidz pool commands freeze o kern/147903 fs [zfs] [panic] Kernel panics on faulty zfs device o kern/147881 fs [zfs] [patch] ZFS "sharenfs" doesn't allow different " o kern/147420 fs [ufs] [panic] ufs_dirbad, nullfs, jail panic (corrupt o kern/146941 fs [zfs] [panic] Kernel Double Fault - Happens constantly o kern/146786 fs [zfs] zpool import hangs with checksum errors o kern/146708 fs [ufs] [panic] Kernel panic in softdep_disk_write_compl o kern/146528 fs [zfs] Severe memory leak in ZFS on i386 o kern/146502 fs [nfs] FreeBSD 8 NFS Client Connection to Server s kern/145712 fs [zfs] cannot offline two drives in a raidz2 configurat o kern/145411 fs [xfs] [panic] Kernel panics shortly after mounting an f bin/145309 fs bsdlabel: Editing disk label invalidates the whole dev o kern/145272 fs [zfs] [panic] Panic during boot when accessing zfs on o kern/145246 fs [ufs] dirhash in 7.3 gratuitously frees hashes when it o kern/145238 fs [zfs] [panic] kernel panic on zpool clear tank o kern/145229 fs [zfs] Vast differences in ZFS ARC behavior between 8.0 o kern/145189 fs [nfs] nfsd performs abysmally under load o kern/144929 fs [ufs] [lor] vfs_bio.c + ufs_dirhash.c p kern/144447 fs [zfs] sharenfs fsunshare() & fsshare_main() non functi o kern/144416 fs [panic] Kernel panic on online filesystem optimization s kern/144415 fs [zfs] [panic] kernel panics on boot after zfs crash o kern/144234 fs [zfs] Cannot boot machine with recent gptzfsboot code o kern/143825 fs [nfs] [panic] Kernel panic on NFS client o bin/143572 fs [zfs] zpool(1): [patch] The verbose output from iostat o kern/143212 fs [nfs] NFSv4 client strange work ... o kern/143184 fs [zfs] [lor] zfs/bufwait LOR o kern/142878 fs [zfs] [vfs] lock order reversal o kern/142597 fs [ext2fs] ext2fs does not work on filesystems with real o kern/142489 fs [zfs] [lor] allproc/zfs LOR o kern/142466 fs Update 7.2 -> 8.0 on Raid 1 ends with screwed raid [re o kern/142306 fs [zfs] [panic] ZFS drive (from OSX Leopard) causes two o kern/142068 fs [ufs] BSD labels are got deleted spontaneously o kern/141897 fs [msdosfs] [panic] Kernel panic. msdofs: file name leng o kern/141463 fs [nfs] [panic] Frequent kernel panics after upgrade fro o kern/141305 fs [zfs] FreeBSD ZFS+sendfile severe performance issues ( o kern/141091 fs [patch] [nullfs] fix panics with DIAGNOSTIC enabled o kern/141086 fs [nfs] [panic] panic("nfs: bioread, not dir") on FreeBS o kern/141010 fs [zfs] "zfs scrub" fails when backed by files in UFS2 o kern/140888 fs [zfs] boot fail from zfs root while the pool resilveri o kern/140661 fs [zfs] [patch] /boot/loader fails to work on a GPT/ZFS- o kern/140640 fs [zfs] snapshot crash o kern/140068 fs [smbfs] [patch] smbfs does not allow semicolon in file o kern/139725 fs [zfs] zdb(1) dumps core on i386 when examining zpool c o kern/139715 fs [zfs] vfs.numvnodes leak on busy zfs p bin/139651 fs [nfs] mount(8): read-only remount of NFS volume does n o kern/139407 fs [smbfs] [panic] smb mount causes system crash if remot o kern/138662 fs [panic] ffs_blkfree: freeing free block o kern/138421 fs [ufs] [patch] remove UFS label limitations o kern/138202 fs mount_msdosfs(1) see only 2Gb o kern/136968 fs [ufs] [lor] ufs/bufwait/ufs (open) o kern/136945 fs [ufs] [lor] filedesc structure/ufs (poll) o kern/136944 fs [ffs] [lor] bufwait/snaplk (fsync) o kern/136873 fs [ntfs] Missing directories/files on NTFS volume o kern/136865 fs [nfs] [patch] NFS exports atomic and on-the-fly atomic p kern/136470 fs [nfs] Cannot mount / in read-only, over NFS o kern/135546 fs [zfs] zfs.ko module doesn't ignore zpool.cache filenam o kern/135469 fs [ufs] [panic] kernel crash on md operation in ufs_dirb o kern/135050 fs [zfs] ZFS clears/hides disk errors on reboot o kern/134491 fs [zfs] Hot spares are rather cold... o kern/133676 fs [smbfs] [panic] umount -f'ing a vnode-based memory dis p kern/133174 fs [msdosfs] [patch] msdosfs must support multibyte inter o kern/132960 fs [ufs] [panic] panic:ffs_blkfree: freeing free frag o kern/132397 fs reboot causes filesystem corruption (failure to sync b o kern/132331 fs [ufs] [lor] LOR ufs and syncer o kern/132237 fs [msdosfs] msdosfs has problems to read MSDOS Floppy o kern/132145 fs [panic] File System Hard Crashes o kern/131441 fs [unionfs] [nullfs] unionfs and/or nullfs not combineab o kern/131360 fs [nfs] poor scaling behavior of the NFS server under lo o kern/131342 fs [nfs] mounting/unmounting of disks causes NFS to fail o bin/131341 fs makefs: error "Bad file descriptor" on the mount poin o kern/130920 fs [msdosfs] cp(1) takes 100% CPU time while copying file o kern/130210 fs [nullfs] Error by check nullfs o kern/129760 fs [nfs] after 'umount -f' of a stale NFS share FreeBSD l o kern/129488 fs [smbfs] Kernel "bug" when using smbfs in smbfs_smb.c: o kern/129231 fs [ufs] [patch] New UFS mount (norandom) option - mostly o kern/129152 fs [panic] non-userfriendly panic when trying to mount(8) o kern/127787 fs [lor] [ufs] Three LORs: vfslock/devfs/vfslock, ufs/vfs o bin/127270 fs fsck_msdosfs(8) may crash if BytesPerSec is zero o kern/127029 fs [panic] mount(8): trying to mount a write protected zi o kern/126287 fs [ufs] [panic] Kernel panics while mounting an UFS file o kern/125895 fs [ffs] [panic] kernel: panic: ffs_blkfree: freeing free s kern/125738 fs [zfs] [request] SHA256 acceleration in ZFS o kern/123939 fs [msdosfs] corrupts new files o kern/122380 fs [ffs] ffs_valloc:dup alloc (Soekris 4801/7.0/USB Flash o bin/122172 fs [fs]: amd(8) automount daemon dies on 6.3-STABLE i386, o bin/121898 fs [nullfs] pwd(1)/getcwd(2) fails with Permission denied o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha o kern/120483 fs [ntfs] [patch] NTFS filesystem locking changes o kern/120482 fs [ntfs] [patch] Sync style changes between NetBSD and F o kern/118912 fs [2tb] disk sizing/geometry problem with large array o kern/118713 fs [minidump] [patch] Display media size required for a k o kern/118318 fs [nfs] NFS server hangs under special circumstances o bin/118249 fs [ufs] mv(1): moving a directory changes its mtime o kern/118126 fs [nfs] [patch] Poor NFS server write performance o kern/118107 fs [ntfs] [panic] Kernel panic when accessing a file at N o kern/117954 fs [ufs] dirhash on very large directories blocks the mac o bin/117315 fs [smbfs] mount_smbfs(8) and related options can't mount o kern/117158 fs [zfs] zpool scrub causes panic if geli vdevs detach on o bin/116980 fs [msdosfs] [patch] mount_msdosfs(8) resets some flags f o conf/116931 fs lack of fsck_cd9660 prevents mounting iso images with o kern/116583 fs [ffs] [hang] System freezes for short time when using o bin/115361 fs [zfs] mount(8) gets into a state where it won't set/un o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o kern/113852 fs [smbfs] smbfs does not properly implement DFS referral o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/111843 fs [msdosfs] Long Names of files are incorrectly created o kern/111782 fs [ufs] dump(8) fails horribly for large filesystems s bin/111146 fs [2tb] fsck(8) fails on 6T filesystem o bin/107829 fs [2TB] fdisk(8): invalid boundary checking in fdisk / w o kern/106107 fs [ufs] left-over fsck_snapshot after unfinished backgro o kern/104406 fs [ufs] Processes get stuck in "ufs" state under persist o kern/104133 fs [ext2fs] EXT2FS module corrupts EXT2/3 filesystems o kern/103035 fs [ntfs] Directories in NTFS mounted disc images appear o kern/101324 fs [smbfs] smbfs sometimes not case sensitive when it's s o kern/99290 fs [ntfs] mount_ntfs ignorant of cluster sizes s bin/97498 fs [request] newfs(8) has no option to clear the first 12 o kern/97377 fs [ntfs] [patch] syntax cleanup for ntfs_ihash.c o kern/95222 fs [cd9660] File sections on ISO9660 level 3 CDs ignored o kern/94849 fs [ufs] rename on UFS filesystem is not atomic o bin/94810 fs fsck(8) incorrectly reports 'file system marked clean' o kern/94769 fs [ufs] Multiple file deletions on multi-snapshotted fil o kern/94733 fs [smbfs] smbfs may cause double unlock o kern/93942 fs [vfs] [patch] panic: ufs_dirbad: bad dir (patch from D o kern/92272 fs [ffs] [hang] Filling a filesystem while creating a sna o kern/91134 fs [smbfs] [patch] Preserve access and modification time a kern/90815 fs [smbfs] [patch] SMBFS with character conversions somet o kern/88657 fs [smbfs] windows client hang when browsing a samba shar o kern/88555 fs [panic] ffs_blkfree: freeing free frag on AMD 64 o bin/87966 fs [patch] newfs(8): introduce -A flag for newfs to enabl o kern/87859 fs [smbfs] System reboot while umount smbfs. o kern/86587 fs [msdosfs] rm -r /PATH fails with lots of small files o bin/85494 fs fsck_ffs: unchecked use of cg_inosused macro etc. o kern/80088 fs [smbfs] Incorrect file time setting on NTFS mounted vi o bin/74779 fs Background-fsck checks one filesystem twice and omits o kern/73484 fs [ntfs] Kernel panic when doing `ls` from the client si o bin/73019 fs [ufs] fsck_ufs(8) cannot alloc 607016868 bytes for ino o kern/71774 fs [ntfs] NTFS cannot "see" files on a WinXP filesystem o bin/70600 fs fsck(8) throws files away when it can't grow lost+foun o kern/68978 fs [panic] [ufs] crashes with failing hard disk, loose po o kern/65920 fs [nwfs] Mounted Netware filesystem behaves strange o kern/65901 fs [smbfs] [patch] smbfs fails fsx write/truncate-down/tr o kern/61503 fs [smbfs] mount_smbfs does not work as non-root o kern/55617 fs [smbfs] Accessing an nsmb-mounted drive via a smb expo o kern/51685 fs [hang] Unbounded inode allocation causes kernel to loc o kern/36566 fs [smbfs] System reboot with dead smb mount and umount o bin/27687 fs fsck(8) wrapper is not properly passing options to fsc o kern/18874 fs [2TB] 32bit NFS servers export wrong negative values t 298 problems total. From owner-freebsd-fs@FreeBSD.ORG Mon Mar 11 17:19:22 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A759C297 for ; Mon, 11 Mar 2013 17:19:22 +0000 (UTC) (envelope-from cr@caltel.com) Received: from mail2.caltel.com (mail2.caltel.com [66.102.145.6]) by mx1.freebsd.org (Postfix) with ESMTP id 8D5EB19C for ; Mon, 11 Mar 2013 17:19:22 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqEEAEkRPlFCZpCq/2dsb2JhbABDxGCBc3SCKQEBAQMBAQEBNRYgCgYLCxgJFg8JAwIBAgEVAQkmDgUCBAEBAQEXAgSHbAYMvWWNXYE4FoMqA4hyiyWCPoEehEmLDoMqHDKBBQ X-IPAS-Result: AqEEAEkRPlFCZpCq/2dsb2JhbABDxGCBc3SCKQEBAQMBAQEBNRYgCgYLCxgJFg8JAwIBAgEVAQkmDgUCBAEBAQEXAgSHbAYMvWWNXYE4FoMqA4hyiyWCPoEehEmLDoMqHDKBBQ X-IronPort-AV: E=Sophos;i="4.84,825,1355126400"; d="scan'208";a="842243" Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local) ([66.102.144.170]) by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA; 11 Mar 2013 10:19:05 -0700 Message-ID: <513E1208.5020804@caltel.com> Date: Mon, 11 Mar 2013 10:19:04 -0700 From: Cody Ritts Organization: CalTel User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130216 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> <1362951595.99445.2.camel@btw.pki2.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Mar 2013 17:19:22 -0000 Update -- fdisk WILL allow you to align without regards to drive geometry It can only be done in interactive mode: http://lists.freebsd.org/pipermail/freebsd-geom/2011-May/004780.html > fdisk -i /dev/ada0 > Do you want to change our idea of what BIOS thinks ? [n] > The data for partition 1 is: > Do you want to change it? [n] y > Supply a decimal value for "sysid (165=FreeBSD)" [165] > Supply a decimal value for "start" [63] 2048 > Supply a decimal value for "size" [125045361] 125043376 > Correct this automatically? [n] > Explicitly specify beg/end address ? [n] > Are we happy with this entry? [n] y > Do you want to change the active partition? [n] > Should we write new partition table? [n] y > > gpart show ada0 > => 63 125045361 ada0 MBR (59G) > 63 1985 - free - (992k) > 2048 125043376 1 freebsd [active] (59G) Thanks, Cody On 3/10/13 9:19 PM, J David wrote: > On Sun, Mar 10, 2013 at 8:39 PM, Warren Block wrote: > >> If FreeBSD had a way to turn off the strict enforcement of CHS values for >> MBR, it would make that unnecessary. >> > > The solution to that for a drive of this size *might* be to use fdisk > instead of gpart to do the partitioning, and just flat out lie to it about > the geometry when it asks you if you want to use something other than what > it read. (Which was also a lie anyway.) > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > From owner-freebsd-fs@FreeBSD.ORG Mon Mar 11 18:09:23 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id AAFA810E for ; Mon, 11 Mar 2013 18:09:23 +0000 (UTC) (envelope-from cr@caltel.com) Received: from mail2.caltel.com (mail2.caltel.com [66.102.145.6]) by mx1.freebsd.org (Postfix) with ESMTP id 9647465E for ; Mon, 11 Mar 2013 18:09:23 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AmYKAA8dPlFCZpCq/2dsb2JhbABDhkSBIAG8fYF1dIIpAQEEAThABgsLGAkWDwkDAgECAUUTCAEBiAkGDL5ZjxUWgyoDiHKNY4EehEmLDoMqHA X-IPAS-Result: AmYKAA8dPlFCZpCq/2dsb2JhbABDhkSBIAG8fYF1dIIpAQEEAThABgsLGAkWDwkDAgECAUUTCAEBiAkGDL5ZjxUWgyoDiHKNY4EehEmLDoMqHA X-IronPort-AV: E=Sophos;i="4.84,825,1355126400"; d="scan'208";a="845956" Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local) ([66.102.144.170]) by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA; 11 Mar 2013 11:09:23 -0700 Message-ID: <513E1DD2.7030609@caltel.com> Date: Mon, 11 Mar 2013 11:09:22 -0700 From: Cody Ritts Organization: CalTel User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130216 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> <513D0E90.5090105@platinum.linux.pl> In-Reply-To: <513D0E90.5090105@platinum.linux.pl> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Mar 2013 18:09:23 -0000 On 3/10/13 3:52 PM, Adam Nowacki wrote: > I don't think zfsboot is aware of BSD disklabel (offsets other than 0 > won't boot). Is there any reason you are using BSD disklabel and not two > partition MBR? The reason is because every example I saw used labels. I just tried it, and it does not boot. I get: FreeBSD/x86 ZFS enabled bootstrap loader. Revision 1.1 ZFS: can't find pool by guid. > I also don't think there is any merit in aligning to 1MiB. Most ZFS IOs > will be aligned to sector size (ashift). Unless ZFS pool is created with > higher ashift then the 63 sector offset is as good as any. Aligning to the Erase block: http://blog.nuclex-games.com/2009/12/aligning-an-ssd-on-linux/ Also I will be forcing ashift to 12 using the gnop trick. If you still feel that is not necessary, I would be interested in knowing why? Thanks, Cody From owner-freebsd-fs@FreeBSD.ORG Mon Mar 11 18:20:31 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id AB5D47FE for ; Mon, 11 Mar 2013 18:20:31 +0000 (UTC) (envelope-from uros.gruber@gmail.com) Received: from mail-ia0-x235.google.com (mail-ia0-x235.google.com [IPv6:2607:f8b0:4001:c02::235]) by mx1.freebsd.org (Postfix) with ESMTP id 860D071C for ; Mon, 11 Mar 2013 18:20:31 +0000 (UTC) Received: by mail-ia0-f181.google.com with SMTP id w33so3906719iag.26 for ; Mon, 11 Mar 2013 11:20:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=uiSgPQrUHHgLifiVj8B6cgFVzWa5ev0Ra48PcuEVLjM=; b=GFKrMg+FyF932q89YzDgIFneeuErMazi4gSnj7gPrulR0qxBSB8Zjk3pXC5CLi3obG sMa8oTtDqTnpftFaGanZntdwcYP63NwgLSLjoeFtoGu+PzNZTdhFp4GN+ytYRXhw3p1g cFnNpufujDQ7dwHJCMSDnIg5bhG2t88z/UC07NZru15aWvCagEvFeEtdWoOhxk6r2WbI NnXl5yLQwfpMwvp9biRvGM3LCUf1a+Xl7ov9EUlif1wIriY2eDK5Oc/QzfZ54by8q1Su /Tl+uz4lu7JNeyen04dE1fLgUuFK55HSPAjtoY7Xg9fPSvpit5mD5J+n/+rzkSt7q9dg StKw== MIME-Version: 1.0 X-Received: by 10.42.189.199 with SMTP id df7mr9365292icb.16.1363026029590; Mon, 11 Mar 2013 11:20:29 -0700 (PDT) Received: by 10.64.26.166 with HTTP; Mon, 11 Mar 2013 11:20:29 -0700 (PDT) Date: Mon, 11 Mar 2013 19:20:29 +0100 Message-ID: Subject: zfs hang with umount From: =?UTF-8?B?VXJvxaEgR3J1YmVy?= To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Mar 2013 18:20:31 -0000 Hi, I don't know what causes this, but while stopping one of jails I also run zfs inherit mountpoint on this jails fs. This jail was in stopping state at this moment. Process than hanged in D state. Then I was doing some stuff on the server, and while "zfs unmount -f" of that fs, server chrased. Now everytime I wan't to unmount that fs process hang. I've managed to send & receive this fs to other fs and mounted sucessfuly. Before I reboot and try to destroy this fs, here is output of procstat -k PID (zfs umount zroot/myfs) PID TID COMM TDNAME KSTACK 3937 100559 zfs - mi_switch sleepq_timedwait _sleep zfs_zget zfs_get_data zil_commit zfs_freebsd_write VOP_WRITE_APV vnode_pager_generic_putpages vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush vm_object_page_clean vm_object_terminate vnode_destroy_vobject zfs_freebsd_reclaim vgonel vflush Is there anything I can check or is this know bug? Server is running on 9.1-RELEASE regards Uros From owner-freebsd-fs@FreeBSD.ORG Mon Mar 11 19:52:39 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id A196CF48 for ; Mon, 11 Mar 2013 19:52:39 +0000 (UTC) (envelope-from nowakpl@platinum.linux.pl) Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4]) by mx1.freebsd.org (Postfix) with ESMTP id 664AFC42 for ; Mon, 11 Mar 2013 19:52:39 +0000 (UTC) Received: by platinum.linux.pl (Postfix, from userid 87) id 8520547E11; Mon, 11 Mar 2013 20:52:37 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl X-Spam-Level: X-Spam-Status: No, score=-1.3 required=3.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.3.2 Received: from [10.255.1.2] (unknown [83.151.38.73]) by platinum.linux.pl (Postfix) with ESMTPA id 277D847DE8 for ; Mon, 11 Mar 2013 20:52:37 +0100 (CET) Message-ID: <513E35EC.4080309@platinum.linux.pl> Date: Mon, 11 Mar 2013 20:52:12 +0100 From: Adam Nowacki User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130215 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> <513D0E90.5090105@platinum.linux.pl> <513E1DD2.7030609@caltel.com> In-Reply-To: <513E1DD2.7030609@caltel.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Mar 2013 19:52:39 -0000 On 2013-03-11 19:09, Cody Ritts wrote: > On 3/10/13 3:52 PM, Adam Nowacki wrote: >> I don't think zfsboot is aware of BSD disklabel (offsets other than 0 >> won't boot). Is there any reason you are using BSD disklabel and not two >> partition MBR? > > The reason is because every example I saw used labels. > I just tried it, and it does not boot. > I get: > > FreeBSD/x86 ZFS enabled bootstrap loader. Revision 1.1 > ZFS: can't find pool by guid. Then I guess zfsloader requires BSD disklabel for MBR (but zfsboot still has to be at offset 0 and 1024 sectors relative to MBR partition as it doesn't read the BSD disklabel). >> I also don't think there is any merit in aligning to 1MiB. Most ZFS IOs >> will be aligned to sector size (ashift). Unless ZFS pool is created with >> higher ashift then the 63 sector offset is as good as any. > > Aligning to the Erase block: > > http://blog.nuclex-games.com/2009/12/aligning-an-ssd-on-linux/ > Also I will be forcing ashift to 12 using the gnop trick. > > If you still feel that is not necessary, I would be interested in > knowing why? The mapping between sectors and physical flash pages/blocks is not fixed and will change on each write or internal garbage collect. http://www.devwhy.com/blog/2009/8/4/from-write-down-to-the-flash-chips.html seems to explain this nicely. Aligning to more than page size offers no benefit since this is the biggest continuous chunk of data that remains continuous all the way to physical flash. If your SSD has page size of 4KiB then align to that. This is sector 504 on FreeBSD (due to the multiple of 63 issue). ZFS pool will have to be created with ashift=12. From owner-freebsd-fs@FreeBSD.ORG Mon Mar 11 19:55:49 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id A2E724CC; Mon, 11 Mar 2013 19:55:49 +0000 (UTC) (envelope-from marck@rinet.ru) Received: from woozle.rinet.ru (woozle.rinet.ru [195.54.192.68]) by mx1.freebsd.org (Postfix) with ESMTP id 322F1CE1; Mon, 11 Mar 2013 19:55:48 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by woozle.rinet.ru (8.14.5/8.14.5) with ESMTP id r2BJtmj7040409; Mon, 11 Mar 2013 23:55:48 +0400 (MSK) (envelope-from marck@rinet.ru) Date: Mon, 11 Mar 2013 23:55:48 +0400 (MSK) From: Dmitry Morozovsky To: araujo@FreeBSD.org Subject: Re: carp on stable/9: is there a way to keep jumbo? In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) X-NCC-RegID: ru.rinet X-OpenPGP-Key-ID: 6B691B03 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (woozle.rinet.ru [0.0.0.0]); Mon, 11 Mar 2013 23:55:48 +0400 (MSK) Cc: freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Mar 2013 19:55:49 -0000 On Tue, 5 Mar 2013, Marcelo Araujo wrote: > > yes, I know glebius@ overhauled carp in -current, but I'm a bit nervous to > > deploy bleeding edge system on a NAS/SAN ;) > > > > So, my question is about current state of carp in stable/9: building HA > > pair I > > found that carp interfaces lose jumbo capabilities: > > > > > Hello Dmitry, > > I made a patch for 9.1-RELEASE, it is totally based on glebius@ work, or > partially :). I'm using it nowadays and it just works pretty fine for me. > > I didn't test with JUMBO frame, but you can give a try and let us know if > it works or not. > > PATCH: http://people.freebsd.org/~araujo/carpdev/ I'vr managed to apply this finally :) It seems your path is sometimes spammed with $FreeBSD$ changes, which leads to 4 .rej's for me (nothing except ./sys/netinet/ip_carp.c.rej are sighnificany, but they may produce problem in future merging) Only buildworld tests are finished for me yet; more to test later and/or tomorrow. Thank you! -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: marck@FreeBSD.org ] ------------------------------------------------------------------------ *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru *** ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Mon Mar 11 21:22:32 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 9514F940 for ; Mon, 11 Mar 2013 21:22:32 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id E80912A0 for ; Mon, 11 Mar 2013 21:22:31 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id XAA08615; Mon, 11 Mar 2013 23:22:22 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1UFAAo-0005M9-6s; Mon, 11 Mar 2013 23:22:22 +0200 Message-ID: <513E4B0B.3090807@FreeBSD.org> Date: Mon, 11 Mar 2013 23:22:19 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130220 Thunderbird/17.0.3 MIME-Version: 1.0 To: =?windows-1252?Q?Uro=9A_Gruber?= Subject: Re: zfs hang with umount References: In-Reply-To: X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit Cc: freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Mar 2013 21:22:32 -0000 on 11/03/2013 20:20 Uroš Gruber said the following: > Hi, > > I don't know what causes this, but while stopping one of jails I also > run zfs inherit mountpoint on this jails fs. This jail was in stopping > state at this moment. Process than hanged in D state. Then I was doing > some stuff on the server, and while "zfs unmount -f" of that fs, > server chrased. Now everytime I wan't to unmount that fs process hang. > I've managed to send & receive this fs to other fs and mounted > sucessfuly. > > Before I reboot and try to destroy this fs, here is output of procstat > -k PID (zfs umount zroot/myfs) > > PID TID COMM TDNAME KSTACK > 3937 100559 zfs - mi_switch > sleepq_timedwait _sleep zfs_zget zfs_get_data zil_commit > zfs_freebsd_write VOP_WRITE_APV vnode_pager_generic_putpages > vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush > vm_object_page_clean vm_object_terminate vnode_destroy_vobject > zfs_freebsd_reclaim vgonel vflush > > Is there anything I can check or is this know bug? > > Server is running on 9.1-RELEASE This should be fixed in stable/9. -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Mon Mar 11 21:31:50 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 5EE81BBA for ; Mon, 11 Mar 2013 21:31:50 +0000 (UTC) (envelope-from cr@caltel.com) Received: from mail2.caltel.com (mail2.caltel.com [66.102.145.6]) by mx1.freebsd.org (Postfix) with ESMTP id 4319C320 for ; Mon, 11 Mar 2013 21:31:49 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap8EAP5LPlFCZpCq/2dsb2JhbABDxGaBdXSCKQEBBThAEQsYCRYPCQMCAQIBRRMIAQGIDwy/YI8VFoMqA4hyjWOBHoRJiw6DKhw X-IPAS-Result: Ap8EAP5LPlFCZpCq/2dsb2JhbABDxGaBdXSCKQEBBThAEQsYCRYPCQMCAQIBRRMIAQGIDwy/YI8VFoMqA4hyjWOBHoRJiw6DKhw X-IronPort-AV: E=Sophos;i="4.84,825,1355126400"; d="scan'208";a="853759" Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local) ([66.102.144.170]) by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA; 11 Mar 2013 14:31:50 -0700 Message-ID: <513E4D45.1020804@caltel.com> Date: Mon, 11 Mar 2013 14:31:49 -0700 From: Cody Ritts Organization: CalTel User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130307 Thunderbird/17.0.4 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> <513D0E90.5090105@platinum.linux.pl> <513E1DD2.7030609@caltel.com> <513E35EC.4080309@platinum.linux.pl> In-Reply-To: <513E35EC.4080309@platinum.linux.pl> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Mar 2013 21:31:50 -0000 On 3/11/13 12:52 PM, Adam Nowacki wrote: >>> I also don't think there is any merit in aligning to 1MiB. Most ZFS IOs >>> will be aligned to sector size (ashift). Unless ZFS pool is created with >>> higher ashift then the 63 sector offset is as good as any. >> >> Aligning to the Erase block: >> >> http://blog.nuclex-games.com/2009/12/aligning-an-ssd-on-linux/ >> Also I will be forcing ashift to 12 using the gnop trick. >> >> If you still feel that is not necessary, I would be interested in >> knowing why? > > The mapping between sectors and physical flash pages/blocks is not fixed > and will change on each write or internal garbage collect. > http://www.devwhy.com/blog/2009/8/4/from-write-down-to-the-flash-chips.html > seems to explain this nicely. Aligning to more than page size offers no > benefit since this is the biggest continuous chunk of data that remains > continuous all the way to physical flash. > > If your SSD has page size of 4KiB then align to that. This is sector 504 > on FreeBSD (due to the multiple of 63 issue). ZFS pool will have to be > created with ashift=12. hmmmm... I see the point you are making, and there is so much that I dont know about zfs, SSDs and ATA. There is a commenter on there who certainly seems to agree with you: https://github.com/zfsonlinux/zfs/pull/924 But the vast majority of pages that claim aligning the partition boundaries to multiples of the erase block is really important. (Not that more pages makes it correct) But if you are right, and aligning to the erase block is pointless because the SSD doesn't care, then it should not hurt if I do add an offset, other than I will loose a few MB of space. It is certainly a good point you make but I just don't have the time to learn everything I need to know to make an educated decision for myself. So if I can satisfy the majority with no detriment, I will just do that so I can get this thing into production. Thanks Cody From owner-freebsd-fs@FreeBSD.ORG Mon Mar 11 21:43:07 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 020F8183; Mon, 11 Mar 2013 21:43:07 +0000 (UTC) (envelope-from uros.gruber@gmail.com) Received: from mail-ie0-x233.google.com (mail-ie0-x233.google.com [IPv6:2607:f8b0:4001:c03::233]) by mx1.freebsd.org (Postfix) with ESMTP id C2B413CA; Mon, 11 Mar 2013 21:43:06 +0000 (UTC) Received: by mail-ie0-f179.google.com with SMTP id k11so5403811iea.38 for ; Mon, 11 Mar 2013 14:43:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type:content-transfer-encoding; bh=fH6dR7rX5IiZs/qX4LkoHNlucOp2Ohd/7CvgwtYXdhs=; b=EIowviBv7PunTSDYT8MCUiDC2Pdj01YncbPf4JpOb/QcRMshcPvv2MWOdsyKZTSzKJ S0YagyS46IYjxzIRtWhH9+xzvxaU0O5xNuEewv79PD09nimT6S0k5X5fl31ktipQZqTa fW5QkusJ9qXMaaX9OMEgP9MhnYxMdMKfp0aLtiMNMXxEQm299bUctFY4Ps1tH0MV+ZCg VeabmDw052yV0Uwj2zVAPT9SjEWO8pO+zQFpjy26I9INw4jhvbu0bSo64qQ2n05pqNk4 IVjCDTMgH7ICDaUfP+oSl/B6keGNoZwk1QFRjf/F8nwcP0RJjotwVn4/+IREfk6qm72a AyWA== MIME-Version: 1.0 X-Received: by 10.50.151.179 with SMTP id ur19mr8954487igb.79.1363038186181; Mon, 11 Mar 2013 14:43:06 -0700 (PDT) Received: by 10.64.26.166 with HTTP; Mon, 11 Mar 2013 14:43:06 -0700 (PDT) In-Reply-To: <513E4B0B.3090807@FreeBSD.org> References: <513E4B0B.3090807@FreeBSD.org> Date: Mon, 11 Mar 2013 22:43:06 +0100 Message-ID: Subject: Re: zfs hang with umount From: =?UTF-8?B?VXJvxaEgR3J1YmVy?= To: Andriy Gapon Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Mar 2013 21:43:07 -0000 Hi Andriy, can you tell me more about this or what was the cause for this so I can avoid it until 9.2 is released. I don't want to jump on 9-STABLE and maybe have more problems with some other stuff. regards Uros On 11 March 2013 22:22, Andriy Gapon wrote: > on 11/03/2013 20:20 Uro=C5=A1 Gruber said the following: >> Hi, >> >> I don't know what causes this, but while stopping one of jails I also >> run zfs inherit mountpoint on this jails fs. This jail was in stopping >> state at this moment. Process than hanged in D state. Then I was doing >> some stuff on the server, and while "zfs unmount -f" of that fs, >> server chrased. Now everytime I wan't to unmount that fs process hang. >> I've managed to send & receive this fs to other fs and mounted >> sucessfuly. >> >> Before I reboot and try to destroy this fs, here is output of procstat >> -k PID (zfs umount zroot/myfs) >> >> PID TID COMM TDNAME KSTACK >> 3937 100559 zfs - mi_switch >> sleepq_timedwait _sleep zfs_zget zfs_get_data zil_commit >> zfs_freebsd_write VOP_WRITE_APV vnode_pager_generic_putpages >> vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush >> vm_object_page_clean vm_object_terminate vnode_destroy_vobject >> zfs_freebsd_reclaim vgonel vflush >> >> Is there anything I can check or is this know bug? >> >> Server is running on 9.1-RELEASE > > This should be fixed in stable/9. > > -- > Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Tue Mar 12 01:45:12 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 7AA797CC for ; Tue, 12 Mar 2013 01:45:12 +0000 (UTC) (envelope-from bryan-lists@shatow.net) Received: from secure.xzibition.com (secure.xzibition.com [173.160.118.92]) by mx1.freebsd.org (Postfix) with ESMTP id 1ACA3EF3 for ; Tue, 12 Mar 2013 01:45:11 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=shatow.net; h=message-id :date:from:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; q=dns; s=sweb; b=dd3SC0 0ldUj9fu7yVJvyBYNsmXExBU/QVFm52Sogjt9b8b3Mz3Kok/0qQXuWxNJKjieBiD tbxkJpfRtgc/jmB4oaz4TD6MDFn5gxQkjxjBBK4sJ8MP10hu8cfP/8IJSwl/Ym7u GklMSx+CxMaLw7Deu8bP0hQY5pxMk9ver98nY= DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=shatow.net; h=message-id :date:from:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; s=sweb; bh=zTkqveSmzYP+ g9DgEtIio0jBZql2HEJjltPTJX9nQV8=; b=pv3QGvpnTIeNw+lXJXsk34wKXJnz bvvohOCEbdNxwo75c+ykqmiP0Ge65rPnoqrVMcUVsiPNdjff4llIouOWJ2QgS2NE 3lYOuqJCNh1i36cszir1Wr9CGegJdowz6+G/A7CdHlfV1wHw/x4mb2QZOaYFKvU0 BIqxuy1kxuccppg= Received: (qmail 38312 invoked from network); 11 Mar 2013 20:38:29 -0500 Received: from unknown (HELO ?10.10.0.24?) (bryan@shatow.net@10.10.0.24) by sweb.xzibition.com with ESMTPA; 11 Mar 2013 20:38:29 -0500 Message-ID: <513E8711.2080909@shatow.net> Date: Mon, 11 Mar 2013 20:38:25 -0500 From: Bryan Drewery User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130215 Thunderbird/17.0.3 MIME-Version: 1.0 To: =?UTF-8?B?VXJvxaEgR3J1YmVy?= Subject: Re: zfs hang with umount References: <513E4B0B.3090807@FreeBSD.org> In-Reply-To: X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Mar 2013 01:45:12 -0000 On 3/11/2013 4:43 PM, UroÅ¡ Gruber wrote: > Hi Andriy, > > can you tell me more about this or what was the cause for this so I > can avoid it until 9.2 is released. I don't want to jump on 9-STABLE > and maybe have more problems with some other stuff. > This can very easily happen with `umount -f`. My understanding is it just takes a file descriptor being open to an unlinked file. Best to avoid -f if possible. If you run into one of these deadlocks you'll need to restart, possibly with reboot(8). > regards > > Uros > > On 11 March 2013 22:22, Andriy Gapon wrote: >> on 11/03/2013 20:20 UroÅ¡ Gruber said the following: >>> Hi, >>> >>> I don't know what causes this, but while stopping one of jails I also >>> run zfs inherit mountpoint on this jails fs. This jail was in stopping >>> state at this moment. Process than hanged in D state. Then I was doing >>> some stuff on the server, and while "zfs unmount -f" of that fs, >>> server chrased. Now everytime I wan't to unmount that fs process hang. >>> I've managed to send & receive this fs to other fs and mounted >>> sucessfuly. >>> >>> Before I reboot and try to destroy this fs, here is output of procstat >>> -k PID (zfs umount zroot/myfs) >>> >>> PID TID COMM TDNAME KSTACK >>> 3937 100559 zfs - mi_switch >>> sleepq_timedwait _sleep zfs_zget zfs_get_data zil_commit >>> zfs_freebsd_write VOP_WRITE_APV vnode_pager_generic_putpages >>> vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush >>> vm_object_page_clean vm_object_terminate vnode_destroy_vobject >>> zfs_freebsd_reclaim vgonel vflush >>> >>> Is there anything I can check or is this know bug? >>> >>> Server is running on 9.1-RELEASE >> >> This should be fixed in stable/9. >> >> -- >> Andriy Gapon > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > -- Regards, Bryan Drewery bdrewery@freenode/EFNet From owner-freebsd-fs@FreeBSD.ORG Tue Mar 12 02:10:39 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 21F83B76 for ; Tue, 12 Mar 2013 02:10:39 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id 860D4FA0 for ; Tue, 12 Mar 2013 02:10:38 +0000 (UTC) Received: from lstewart.caia.swin.edu.au (lstewart.caia.swin.edu.au [136.186.229.95]) by lauren.room52.net (Postfix) with ESMTPSA id 7A1627E81E for ; Tue, 12 Mar 2013 13:10:30 +1100 (EST) Message-ID: <513E8E95.6010802@freebsd.org> Date: Tue, 12 Mar 2013 13:10:29 +1100 From: Lawrence Stewart User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130213 Thunderbird/17.0.2 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: ZFS triggered 9-STABLE r246646 panic "vdrop: holdcnt 0" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY autolearn=unavailable version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on lauren.room52.net X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Mar 2013 02:10:39 -0000 Hi all, I got this panic yesterday. I haven't seen it before (or since), but I have the crashdump and kernel here if there's additional information I can provide that would be useful in finding the cause. The machine runs ZFS exclusively and was under quite heavy CPU and IO load at the time of the crash as I was compiling in a VirtualBox VM and on the host itself, as well as running a full KDE desktop environment. I'm fairly certain the machine was not swapping at the time of the crash. lstewart@lstewart> uname -a FreeBSD lstewart 9.1-STABLE FreeBSD 9.1-STABLE #8 r246646M: Mon Feb 11 14:57:13 EST 2013 root@lstewart:/usr/obj/usr/src/sys/LSTEWART-DESKTOP amd64 lstewart@lstewart> sudo kgdb /boot/kernel/kernel /var/crash/vmcore.0 [...] (kgdb) bt #0 doadump (textdump=) at pcpu.h:229 #1 0xffffffff808e5824 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:448 #2 0xffffffff808e5d27 in panic (fmt=0x1
) at /usr/src/sys/kern/kern_shutdown.c:636 #3 0xffffffff8097a71e in vdropl (vp=) at /usr/src/sys/kern/vfs_subr.c:2465 #4 0xffffffff80b4da2b in vm_page_alloc (object=0xffffffff8132c000, pindex=143696, req=32) at /usr/src/sys/vm/vm_page.c:1569 #5 0xffffffff80b3f312 in kmem_back (map=0xfffffe00020000e8, addr=18446743524542296064, size=131072, flags=705200752) at /usr/src/sys/vm/vm_kern.c:361 #6 0xffffffff80b3fc8b in kmem_malloc (map=0xfffffe00020000e8, size=131072, flags=2) at /usr/src/sys/vm/vm_kern.c:312 #7 0xffffffff80b3685a in uma_large_malloc (size=131072, wait=2) at /usr/src/sys/vm/uma_core.c:3068 #8 0xffffffff808d0539 in malloc (size=131072, mtp=0xffffffff817f4ce0, flags=2) at /usr/src/sys/kern/kern_malloc.c:492 #9 0xffffffff816696e2 in zio_write_bp_init (zio=0xfffffe016b70a000) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1060 #10 0xffffffff81668e23 in zio_execute (zio=0xfffffe016b70a000) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1256 #11 0xffffffff80928474 in taskqueue_run_locked (queue=0xfffffe0010484280) at /usr/src/sys/kern/subr_taskqueue.c:312 #12 0xffffffff80929426 in taskqueue_thread_loop (arg=) at /usr/src/sys/kern/subr_taskqueue.c:501 #13 0xffffffff808b67af in fork_exit (callout=0xffffffff809293e0 , arg=0xfffffe00103869d0, frame=0xffffff823df70b00) at /usr/src/sys/kern/kern_fork.c:988 #14 0xffffffff80c4ddee in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:602 #15 0x0000000000000000 in ?? () (kgdb) frame 4 #4 0xffffffff80b4da2b in vm_page_alloc (object=0xffffffff8132c000, pindex=143696, req=32) at /usr/src/sys/vm/vm_page.c:1569 1569 vdrop(vp); (kgdb) p *vp $3 = {v_type = VREG, v_tag = 0xffffffff816f7842 "zfs", v_op = 0xffffffff816ff7a0, v_data = 0xfffffe00784e42e0, v_mount = 0xfffffe0010890000, v_nmntvnodes = {tqe_next = 0xfffffe00a95281f8, tqe_prev = 0xfffffe0091d09220}, v_un = { vu_mount = 0x0, vu_socket = 0x0, vu_cdev = 0x0, vu_fifoinfo = 0x0}, v_hashlist = {le_next = 0x0, le_prev = 0x0}, v_hash = 19896209, v_cache_src = {lh_first = 0x0}, v_cache_dst = {tqh_first = 0x0, tqh_last = 0xfffffe012f979258}, v_cache_dd = 0x0, v_cstart = 0, v_lasta = 0, v_lastw = 0, v_clen = 0, v_lock = {lock_object = { lo_name = 0xffffffff816f7842 "zfs", lo_flags = 91947008, lo_data = 0, lo_witness = 0x0}, lk_lock = 1, lk_exslpfail = 0, lk_timo = 51, lk_pri = 96}, v_interlock = {lock_object = {lo_name = 0xffffffff80ec2790 "vnode interlock", lo_flags = 16973824, lo_data = 0, lo_witness = 0x0}, mtx_lock = 18446741874964127744}, v_vnlock = 0xfffffe012f979290, v_holdcnt = 0, v_usecount = 0, v_iflag = 256, v_vflag = 0, v_writecount = 0, v_actfreelist = {tqe_next = 0xfffffe00a95281f8, tqe_prev = 0xfffffe0091d09308}, v_bufobj = {bo_mtx = {lock_object = {lo_name = 0xffffffff80ec27a0 "bufobj interlock", lo_flags = 16973824, lo_data = 0, lo_witness = 0x0}, mtx_lock = 4}, bo_clean = {bv_hd = {tqh_first = 0x0, tqh_last = 0xfffffe012f979338}, bv_root = 0x0, bv_cnt = 0}, bo_dirty = {bv_hd = {tqh_first = 0x0, tqh_last = 0xfffffe012f979358}, bv_root = 0x0, bv_cnt = 0}, bo_numoutput = 0, bo_flag = 0, bo_ops = 0xffffffff81253920, bo_bsize = 131072, bo_object = 0xfffffe0070ba5910, bo_synclist = {le_next = 0x0, le_prev = 0x0}, bo_private = 0xfffffe012f9791f8, __bo_vnode = 0xfffffe012f9791f8}, v_pollinfo = 0x0, v_label = 0x0, v_lockf = 0x0, v_rl = { rl_waiters = {tqh_first = 0x0, tqh_last = 0xfffffe012f9793d8}, rl_currdep = 0x0}} (kgdb) p *object $6 = {mtx = {lock_object = {lo_name = 0xffffffff80ee61ad "vm object", lo_flags = 21168128, lo_data = 0, lo_witness = 0x0}, mtx_lock = 18446741874964127744}, object_list = {tqe_next = 0xffffffff8132bcc0, tqe_prev = 0xffffffff8132bf20}, shadow_head = { lh_first = 0x0}, shadow_list = {le_next = 0x0, le_prev = 0x0}, memq = {tqh_first = 0xfffffe021eebb880, tqh_last = 0xfffffe022a0882f8}, root = 0xfffffe022a0882e8, size = 134217727, generation = 1, ref_count = 2659, shadow_count = 0, memattr = 6 '\006', type = 4 '\004', flags = 4096, pg_color = 0, pad1 = 0, resident_page_count = 124507, backing_object = 0x0, backing_object_offset = 0, pager_object_list = {tqe_next = 0x0, tqe_prev = 0x0}, rvq = { lh_first = 0xfffffe021df8a5c0}, cache = 0x0, handle = 0x0, un_pager = {vnp = {vnp_size = 0, writemappings = 0}, devp = { devp_pglist = {tqh_first = 0x0, tqh_last = 0x0}, ops = 0x0}, sgp = {sgp_pglist = {tqh_first = 0x0, tqh_last = 0x0}}, swp = { swp_bcount = 0}}, cred = 0x0, charge = 0, paging_in_progress = 0} Cheers, Lawrence From owner-freebsd-fs@FreeBSD.ORG Tue Mar 12 06:53:39 2013 Return-Path: Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 42EF6162; Tue, 12 Mar 2013 06:53:39 +0000 (UTC) (envelope-from linimon@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 1D38CB5C; Tue, 12 Mar 2013 06:53:39 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r2C6rc7U068066; Tue, 12 Mar 2013 06:53:38 GMT (envelope-from linimon@freefall.freebsd.org) Received: (from linimon@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r2C6rcWH068065; Tue, 12 Mar 2013 06:53:38 GMT (envelope-from linimon) Date: Tue, 12 Mar 2013 06:53:38 GMT Message-Id: <201303120653.r2C6rcWH068065@freefall.freebsd.org> To: linimon@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-fs@FreeBSD.org From: linimon@FreeBSD.org Subject: Re: kern/176857: [softupdates] [panic] 9.1-RELEASE/amd64/GENERIC panic in softdepflush/remove_from_journal X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Mar 2013 06:53:39 -0000 Old Synopsis: [panic] [suj] 9.1-RELEASE/amd64/GENERIC panic in softdepflush/remove_from_journal New Synopsis: [softupdates] [panic] 9.1-RELEASE/amd64/GENERIC panic in softdepflush/remove_from_journal Responsible-Changed-From-To: freebsd-bugs->freebsd-fs Responsible-Changed-By: linimon Responsible-Changed-When: Tue Mar 12 06:52:30 UTC 2013 Responsible-Changed-Why: Change the tag for consistency, and assign. http://www.freebsd.org/cgi/query-pr.cgi?pr=176857 From owner-freebsd-fs@FreeBSD.ORG Tue Mar 12 09:33:47 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 92174901 for ; Tue, 12 Mar 2013 09:33:47 +0000 (UTC) (envelope-from peter.maloney@brockmann-consult.de) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.17.10]) by mx1.freebsd.org (Postfix) with ESMTP id 13D4A677 for ; Tue, 12 Mar 2013 09:33:47 +0000 (UTC) Received: from [10.3.0.26] ([141.4.215.32]) by mrelayeu.kundenserver.de (node=mreu4) with ESMTP (Nemesis) id 0M2CHo-1V4pkU08m4-00s8Es; Tue, 12 Mar 2013 10:33:46 +0100 Message-ID: <513EF679.2080402@brockmann-consult.de> Date: Tue, 12 Mar 2013 10:33:45 +0100 From: Peter Maloney User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Cody Ritts Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> In-Reply-To: <513C1629.50501@caltel.com> X-Enigmail-Version: 1.5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:kmknN+RWGhdQeaB54jmzQVrRkNXnDp2QGvsdMJSOmxZ 8jysFi1+xfsUx1Qmm2trYvsI9jnIjEny0Bd1z5RPXB5mfNS75J vGOqJ7jmvy71ypPR6Cp8Cjx1ucLYrG24uGX9WNv4WWxAJcrXU3 KI8bx6E2Pd+nFsdefHRdLSwt+PLi7aYDWKIjSV2I32kFjVWWu3 l41lWySaPKEt4jSYuAnrz7WNTxDz5xb8SqFKMP2H7rhfn9Wshe fTbv1ce5nWkGppmoU5JLqQwuODdAw6haly5nj54irQNr7hUv/E zNK7lrndHYu3pUltGz3R7vAbIPyn+GfkJ37E+aTpEEkX1LPza3 YipuMvgCLGj8ESTF7fJnwCxs1DiKqST5U3VED5Vo7 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Mar 2013 09:33:47 -0000 On 2013-03-10 06:12, Cody Ritts wrote: > I think remember reading that freebsd-zfs had to be the first slice (I > cannot remember where i read that). And it apparently does not think > an offset is funny. For the gptzfsboot boot loader (and don't know which others), the bootable zfs slice has to be in the same pool as the first slice found. Or in other words, the first ZFS slice found by the bootloader must be the cache, log, or data vdev of the bootable pool. I learned this the hard way ;) www.freebsd.org/cgi/query-pr.cgi?pr=160706 eg. this would fail: slice 1 - freebsd-boot slice 2 - zfs /tank slice 3 - zfs root with /boot this would also fail: slice 1 - freebsd-boot slice 2 - zfs /tank L2ARC cache slice 3 - zfs root with /boot this would probably work (fits the rule, but I didn't test it): slice 1 - freebsd-boot slice 2 - zfs root L2ARC cache slice 3 - zfs /tank slice 4 - zfs root with /boot and this will definitely work: slice 1 - freebsd-boot slice 2 - zfs root with /boot slice 3 - zfs /tank Above examples are with gptzfsboot loader; not sure if you need a freebsd-boot for others. And FYI in possibly all BIOS machines (non-EFI), the boot slice probably needs to be before a certain sector... not sure which number, maybe 2.2TB/2.0TiB. I always just put it first. On my FreeBSD zfs machines, I put it at sector 34 which is badly aligned but means the next regular one can start at 2048, which saves me a whole megabyte! -- -------------------------------------------- Peter Maloney Brockmann Consult Max-Planck-Str. 2 21502 Geesthacht Germany Tel: +49 4152 889 300 Fax: +49 4152 889 333 E-mail: peter.maloney@brockmann-consult.de Internet: http://www.brockmann-consult.de -------------------------------------------- From owner-freebsd-fs@FreeBSD.ORG Tue Mar 12 10:25:40 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 205BF58C for ; Tue, 12 Mar 2013 10:25:40 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au [211.29.132.183]) by mx1.freebsd.org (Postfix) with ESMTP id 90BE68FA for ; Tue, 12 Mar 2013 10:25:39 +0000 (UTC) Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106]) by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r2CAPPvi019168 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 12 Mar 2013 21:25:27 +1100 Date: Tue, 12 Mar 2013 21:25:25 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Cody Ritts Subject: Re: Aligning MBR for ZFS boot help In-Reply-To: <513E1208.5020804@caltel.com> Message-ID: <20130312203745.A1130@besplex.bde.org> References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> <1362951595.99445.2.camel@btw.pki2.com> <513E1208.5020804@caltel.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=bNdOu4CZ c=1 sm=1 a=u3bVZBOdoLwA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=cUKNXEIY390A:10 a=6I5d2MoRAAAA:8 a=d1Asgjw0WGCMbu39xngA:9 a=CjuIK1q_8ugA:10 a=3JYNrmlC3cAA:10 a=ApFyF_lCYB5S5Om7:21 a=2tyWdEw6j5jF6zq2:21 a=TEtd8y5WR3g2ypngnwZWYw==:117 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Mar 2013 10:25:40 -0000 On Mon, 11 Mar 2013, Cody Ritts wrote: > Update -- > > fdisk WILL allow you to align without regards to drive geometry > > It can only be done in interactive mode: > http://lists.freebsd.org/pipermail/freebsd-geom/2011-May/004780.html It can be set in all modes. At least according to the man page: @ CONFIGURATION FILE @ When the -f option is given, a disk's slice table can be written using @ values from a configfile. The syntax of this file is very simple; each @ line is either a comment or a specification, as follows: @ @ # comment ... @ Lines beginning with a # are comments and are ignored. @ @ g spec1 spec2 spec3 @ Set the BIOS geometry used in slice calculations. There must be @ three values specified, with a letter preceding each number: @ @ cnum Set the number of cylinders to num. @ @ hnum Set the number of heads to num. @ @ snum Set the number of sectors/track to num. @ @ These specs can occur in any order, as the leading letter deter- @ mines which value is which; however, all three must be specified. @ @ This line must occur before any lines that specify slice informa- @ tion. @ @ It is an error if the following is not true: @ @ 1 <= number of cylinders @ 1 <= number of heads <= 256 Using 256 risks stepping on BIOS bugs or bugs in other OS's. The default for all large disks is 255. But this should only be used if you aren't trying to align things to a power of 2 boundary, since it is not a power of 2, so using it makes the calculations more complicated and/or requires skipping something like 8 full fake cylinders of size 63*255 sectors each to reach a fake cylinder starting on a 4K boundary. (This assumes a sector size of 512.) The old SCSI default should be used. IIRC, it is 32 sectors and 64 heads. 32 is the largest power of 2 less than the limit of 63, and 64 is a large though not maximal power of 2 less than the limit of 256. This makes the fake cylinder size 1 MB. @ 1 <= number of sectors/track < 64 @ @ The number of cylinders should be less than or equal to 1024, but Nah, the number of cylinders shouldn't be less than or equal to 1024, since if that is all it is then the maximum disk size with 512-byte sectors is 63*256*1024*512 = 7.87 GB = 8.46 disk maufacturers GB. You can't buy a new hard disk that small. However, fdisk only uses the number of cylinders for initializing defaults for partition sizes. It can be set to almost any garbage value if you don't use the defaults. @ this is not enforced, although a warning will be printed. Note Indeed, it still prints bogus warnings. @ that bootable FreeBSD slices (the ``/'' file system) must lie @ completely within the first 1024 cylinders; if this is not true, @ booting may fail. Non-bootable slices do not have this restric- @ tion. Note that this is not true, except it only says "may fail". Booting from (fake) cylinders above 1024 was implemented in the FreeBSD boot loader on 26 June 2000. @ @ Example (all of these are equivalent), for a disk with 1019 @ cylinders, 39 heads, and 63 sectors: @ @ g c1019 h39 s63 @ g h39 c1019 s63 @ g s63 h39 c1019 >> fdisk -i /dev/ada0 >> Do you want to change our idea of what BIOS thinks ? [n] Note that fdisk has no idea what the BIOS thinks. The numbers here are what FreeBSD thinks. FreeBSD used to try to determine what the BIOS thinks, but this was broken by GEOM. GEOM just uses whatever the disk says is its "firmware" geometry. But the ATA standard specifies that for disks larger than the magic 8.46 GB number mentioned above, that the fake geometry is always 63 sectors only 16 heads. Thus: - the default geometry in fdisk and presumably in gpart is wrong if the BIOS doesn't use it. Some BIOSes default to 240 heads. Some BIOSes allow you to choose between several fake geometries. Some BIOSes allow you to specify the precise geometry. - with only 16 heads, the 1024-cylinder limit is reached at a disk size of only 528 disk manufacturers GB, so fdisk's bogus warnings about this occur for disks less that 20 years old instead of only for disks less than 14 years old. Last time I looked, Linux fdisk[s] worked better on FreeBSD than FreeBSD fdisk, partly because they don't depend on special ioctls, so that they know that they don't know the BIOS geometry. Specifying the geometry is so routine that it is a command-line parameter in all Linux fdisks in FreeBSD ports (at least in old versions). Bruce From owner-freebsd-fs@FreeBSD.ORG Tue Mar 12 10:49:36 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 7BC79971 for ; Tue, 12 Mar 2013 10:49:36 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id 0B058AEF for ; Tue, 12 Mar 2013 10:49:35 +0000 (UTC) Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r2CAnMZW022890 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 12 Mar 2013 21:49:24 +1100 Date: Tue, 12 Mar 2013 21:49:22 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans Subject: Re: Aligning MBR for ZFS boot help In-Reply-To: <20130312203745.A1130@besplex.bde.org> Message-ID: <20130312213522.R1412@besplex.bde.org> References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> <1362951595.99445.2.camel@btw.pki2.com> <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=DdhPMYRW c=1 sm=1 a=u3bVZBOdoLwA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=cUKNXEIY390A:10 a=GRtgF4SnNIrABPWBEScA:9 a=CjuIK1q_8ugA:10 a=TEtd8y5WR3g2ypngnwZWYw==:117 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Mar 2013 10:49:36 -0000 On Tue, 12 Mar 2013, Bruce Evans wrote: > Last time I looked, Linux fdisk[s] worked better on FreeBSD than FreeBSD > fdisk, partly because they don't depend on special ioctls, so that they > know that they don't know the BIOS geometry. Specifying the geometry is > so routine that it is a command-line parameter in all Linux fdisks in > FreeBSD ports (at least in old versions). Just tried an old version of them on my version of an old version of FreeBSD. They worked not so well: - fdisk-linux: worked OK except for bogus warnings about > 1024 cylinders and a not so bogus warning about a slice not ending on a cylinder boundary. The slice just ends at the end of the disk, and since the cylinders are fake that happens not to be a cylinder boundary. The disk has a firmware fake geometry of 63 sectors 16 heads and mumble cylinders, and the disk manufacturer throws away sectors at the end to make it end on a cylinder boundary with these fake cylinders, but I use different fake cylinders. FreeBSD, Linux and WinXP don't care about the slice not ending on a cylinder boundary. - sfdisk-linux: refused to start without write permission - cfdisk-linux: refused to start due to 1 slice not ending on a cylinder boundary. With its -z workaround for not starting, it starts but is useless since it doesn't display the existing partitions. Bruce From owner-freebsd-fs@FreeBSD.ORG Tue Mar 12 20:24:49 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 7384C94 for ; Tue, 12 Mar 2013 20:24:49 +0000 (UTC) (envelope-from cr@caltel.com) Received: from mail2.caltel.com (mail2.caltel.com [66.102.145.6]) by mx1.freebsd.org (Postfix) with ESMTP id 52D2C62C for ; Tue, 12 Mar 2013 20:24:49 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap8EAF2OP1FCZpCq/2dsb2JhbABDxGqBYXSCKQEBAQMBAQI1RgsLGAklDwIXLxMIAQGICgYMsVSPco8UFoMqA4hziyWCPoEfhEmLDoMqHA X-IPAS-Result: Ap8EAF2OP1FCZpCq/2dsb2JhbABDxGqBYXSCKQEBAQMBAQI1RgsLGAklDwIXLxMIAQGICgYMsVSPco8UFoMqA4hziyWCPoEfhEmLDoMqHA X-IronPort-AV: E=Sophos;i="4.84,833,1355126400"; d="scan'208";a="908682" Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local) ([66.102.144.170]) by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA; 12 Mar 2013 13:24:37 -0700 Message-ID: <513F8F04.60206@caltel.com> Date: Tue, 12 Mar 2013 13:24:36 -0700 From: Cody Ritts Organization: CalTel User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130307 Thunderbird/17.0.4 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> <1362951595.99445.2.camel@btw.pki2.com> <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org> In-Reply-To: <20130312203745.A1130@besplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Mar 2013 20:24:49 -0000 On 3/12/13 3:25 AM, Bruce Evans wrote: >> Update -- >> >> fdisk WILL allow you to align without regards to drive geometry >> >> It can only be done in interactive mode: >> http://lists.freebsd.org/pipermail/freebsd-geom/2011-May/004780.html > > It can be set in all modes. At least according to the man page: In interactive mode, you can simply set the start and size of your partition be done with it and boot away. If you then export that config file, and re-run it, you are back to being aligned to CHS. I would imagine that adjusting your CHS ~correctly~ in the config file will allow you to to do it, but I have not found myself motivated to really learn about adjusting those values. I will have to pick that up someday I suppose. For informational purposes here is 1) partition w/ offset 2) show results 3) export/import config 4) show results with adjusted offset > root@:/root # fdisk -i ada0 > Do you want to change our idea of what BIOS thinks ? [n] > Do you want to change it? [n] y > Supply a decimal value for "sysid (165=FreeBSD)" [165] > Supply a decimal value for "start" [63] 4096 > Supply a decimal value for "size" [125045361] 125041328 > Correct this automatically? [n] > Explicitly specify beg/end address ? [n] > Are we happy with this entry? [n] y > Do you want to change it? [n] > Do you want to change it? [n] > Do you want to change it? [n] > Do you want to change the active partition? [n] > Should we write new partition table? [n] y > > root@:/root # gpart show ada0 > => 63 125045361 ada0 MBR (59G) > 63 4033 - free - (2M) > 4096 125041328 1 freebsd [active] (59G) > > root@:/root # fdisk -p ada0 > # /dev/ada0 > g c124053 h16 s63 > p 1 0xa5 4096 125041328 > a 1 > > root@:/root # fdisk -p ada0 > command > > root@:/root # fdisk -f command ada0 > ******* Working on device /dev/ada0 ******* > fdisk: WARNING line 2: number of cylinders (124053) may be out-of-range > (must be within 1-1024 for normal BIOS operation, unless the entire disk > is dedicated to FreeBSD) > fdisk: WARNING: adjusting start offset of partition 1 > from 4096 to 4158, to fall on a head boundary > fdisk: WARNING: adjusting size of partition 1 from 125041328 to 125041266 > to end on a cylinder boundary > > root@:/root # fdisk -p ada0 > # /dev/ada0 > g c124053 h16 s63 > p 1 0xa5 4158 125041266 > a 1 > > root@:/root # gpart show ada0 > => 63 125045361 ada0 MBR (59G) > 63 4095 - free - (2M) > 4158 125041266 1 freebsd [active] (59G) Thanks, Cody From owner-freebsd-fs@FreeBSD.ORG Wed Mar 13 10:30:27 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id B5D2D691 for ; Wed, 13 Mar 2013 10:30:27 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from melon.pingpong.net (melon.pingpong.net [79.136.116.200]) by mx1.freebsd.org (Postfix) with ESMTP id 5A448622 for ; Wed, 13 Mar 2013 10:30:27 +0000 (UTC) Received: from girgBook.local (c-2754e155.1525-1-64736c12.cust.bredbandsbolaget.se [85.225.84.39]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by melon.pingpong.net (Postfix) with ESMTPSA id 4ED341431A for ; Wed, 13 Mar 2013 11:23:13 +0100 (CET) Message-ID: <51405391.1020006@FreeBSD.org> Date: Wed, 13 Mar 2013 11:23:13 +0100 From: Palle Girgensohn User-Agent: Postbox 3.0.7 (Macintosh/20130119) MIME-Version: 1.0 To: freebsd-fs@FreeBSD.org Subject: leaking lots of unreferenced inodes (pg_xlog files?), maybe after moving tables and indexes to tablespace on different volume References: <513FCA39.7030709@FreeBSD.org> In-Reply-To: <513FCA39.7030709@FreeBSD.org> X-Enigmail-Version: 1.2.3 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Mar 2013 10:30:27 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi! Running postgresql-9.2.2 on FreeBSD 9.1 amd64 using vanilla ufs file system. I have the postgresql base/ on the /usr disk, and a separate volume /opt where the default tablespace resides. This means that the amount of data on the /usr disk sould be stable. This is not the case, the disk usage grows linearly (it seems to leave many inodes unreferenced). The the discrepancy between df and du is now huge: # du -sxh /usr; df -h /usr 4,6G /usr Filesystem Size Used Avail Capacity Mounted on /dev/da0s1f 104G 88G 8.0G 92% /usr 4,6G vs 88GB, that must be more than a rounding error? Strange thing is I cannot find any open files among the missing. # lsof /usr| awk '{print $9}'|xargs ls -l > /dev/null returns no errors (a missing file would render an error with ls). If there where open files not referenced in any directory, they should be found. Next thing is fsck, and yes, there are plenty of unreferenced files. I ran fsck while system is running (i.e. read only) to get a grip oif the amount of lost inodes: fsck /usr | awk '{print $1}'|cut -f 2 -d=| perl -e '$i = 0; while (<>) { $i += $_;}; print $i / 1024 / 1024; print "\n";' 85223.3530330658 ~85 GB gone, that's 80% of the disk, and it accounts fo all the missing space. MTIME for the inodes are pretty evenly spread over time since the machine was updated to FreeBSD 9.1, rebooted, and PostgreSQL was updated to 9.2. All was done at the same time, so I can't really tell who's to blaim, but this is the only server, out of a dozen that where updated to exactly the same versions, that has this problem. All other servers have their /usr disk usage stable (since all data resides on a separate tablespace). The unreferenced inodes are almost exclusively around 16 MB in size, so they most certainly all are postgresql pg_xlog files. This means all files are lost from the same portion of code in the database engine. How could it possibly be able to leave unreferenced inodes around like this at such a scale? Is the culprit a combination of postgresql and file system code? Both where updated. pg_xlog checkpoints seems to happen approximately every three minutes: Mar 13 00:39:08 dbserver postgres[5298]: [48-1] db=,user= LOG: checkpoint starting: time Mar 13 00:41:38 dbserver postgres[5298]: [49-1] db=,user= LOG: checkpoint complete: wrote 2542 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=149.667 s, sync=0.101 s, total=149.770 s; sync files=628, longest=0.021 s, average=0.000 s Mar 13 00:44:08 dbserver postgres[5298]: [50-1] db=,user= LOG: checkpoint starting: time Mar 13 00:46:38 dbserver postgres[5298]: [51-1] db=,user= LOG: checkpoint complete: wrote 3996 buffers (0.4%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=149.438 s, sync=0.111 s, total=149.551 s; sync files=823, longest=0.006 s, average=0.000 s Mar 13 00:49:08 dbserver postgres[5298]: [52-1] db=,user= LOG: checkpoint starting: time Mar 13 00:51:38 dbserver postgres[5298]: [53-1] db=,user= LOG: checkpoint complete: wrote 13736 buffers (1.4%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=149.958 s, sync=0.311 s, total=150.271 s; sync files=1335, longest=0.079 s, average=0.000 s Mar 13 00:54:08 dbserver postgres[5298]: [54-1] db=,user= LOG: checkpoint starting: time Mar 13 00:56:38 dbserver postgres[5298]: [55-1] db=,user= LOG: checkpoint complete: wrote 14638 buffers (1.5%); 0 transaction log file(s) added, 0 removed, 17 recycled; write=149.330 s, sync=0.271 s, total=149.603 s; sync files=1363, longest=0.017 s, average=0.000 s Mar 13 00:59:08 dbserver postgres[5298]: [56-1] db=,user= LOG: checkpoint starting: time Mar 13 01:01:38 dbserver postgres[5298]: [57-1] db=,user= LOG: checkpoint complete: wrote 8035 buffers (0.8%); 0 transaction log file(s) added, 0 removed, 21 recycled; write=149.285 s, sync=0.146 s, total=149.433 s; sync files=1160, longest=0.003 s, average=0.000 s Mar 13 01:04:08 dbserver postgres[5298]: [58-1] db=,user= LOG: checkpoint starting: time Mar 13 01:06:37 dbserver postgres[5298]: [59-1] db=,user= LOG: checkpoint complete: wrote 2156 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 9 recycled; write=149.402 s, sync=0.057 s, total=149.461 s; sync files=610, longest=0.000 s, average=0.000 s Mar 13 01:09:08 dbserver postgres[5298]: [60-1] db=,user= LOG: checkpoint starting: time I'm pretty certain that unmounting the file system and running fsck will regain the lost space, but will it stop there? Stopping postgresql briefly did not help, I tried that. That would have helped if the files where open, but they're not. It seems to postgresql did the right thing, and FreeBSD failed to unreference the files. The server has about 30 databases and ~127 concurrent connections (not all beeing active simultaneously, though), so it is fair to say it is pretty active, but nothing extreme. Hardware is HP DL360, using their HT Smart Array P410i. Any ideas how to debug this? Or shall I just reboot, fsck, hope the problem will go away, and when it does, forget about it? Thanks, Palle -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJRQFORAAoJEIhV+7FrxBJDzVUIAJHU011JDxLxj8/xg05Gwhgq XK3xB+0N0NSUQ50yhcRKLINz/j/XfeS0ZxlH+MstaPA9y0r1JUXMxkb/uTUvGBiy jutk3eVe0cati9cVZbJkRU5FxEgmQ0fg0GOMl3RQAErkh5achj+klWvN7PnwGjTs O3L9RgckKuxTJffk52GAS05qY/TKR6f08kdX3I2cFtqw3tyTyrXU0JPdk2snuPhv H40xV46zgtWMFDvZLt61MryQ7/JotVQwU78scUB+zxrf8KKM9V0mM7pk0pIbG4Qw NJBpZJ5gjbl4x+dkQrtZdL65yq88hACYwo9D+83Ct4ig8tgcQ7ViNHWxJqknK7Q= =3ZZs -----END PGP SIGNATURE----- From owner-freebsd-fs@FreeBSD.ORG Wed Mar 13 13:34:15 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 0D2792D7 for ; Wed, 13 Mar 2013 13:34:15 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail12.syd.optusnet.com.au (mail12.syd.optusnet.com.au [211.29.132.193]) by mx1.freebsd.org (Postfix) with ESMTP id 8ADA9338 for ; Wed, 13 Mar 2013 13:34:13 +0000 (UTC) Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106]) by mail12.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r2DDXxtC032494 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 14 Mar 2013 00:34:01 +1100 Date: Thu, 14 Mar 2013 00:33:59 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Cody Ritts Subject: Re: Aligning MBR for ZFS boot help In-Reply-To: <513F8F04.60206@caltel.com> Message-ID: <20130313232247.B1078@besplex.bde.org> References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> <1362951595.99445.2.camel@btw.pki2.com> <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org> <513F8F04.60206@caltel.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=Q4OKePKa c=1 sm=1 a=u3bVZBOdoLwA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=cUKNXEIY390A:10 a=6I5d2MoRAAAA:8 a=ikclS7t6qE9F1bxxU1YA:9 a=CjuIK1q_8ugA:10 a=3JYNrmlC3cAA:10 a=z_RuTD8k6CXfqtkX:21 a=7nNRCSddQ2EWGmUR:21 a=TEtd8y5WR3g2ypngnwZWYw==:117 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Mar 2013 13:34:15 -0000 On Tue, 12 Mar 2013, Cody Ritts wrote: > On 3/12/13 3:25 AM, Bruce Evans wrote: >>> Update -- >>> >>> fdisk WILL allow you to align without regards to drive geometry >>> >>> It can only be done in interactive mode: >>> http://lists.freebsd.org/pipermail/freebsd-geom/2011-May/004780.html >> >> It can be set in all modes. At least according to the man page: > > In interactive mode, you can simply set the start and size of your partition > be done with it and boot away. That will usually give nonsense beginning and ending CHS values. This will usually prevent BIOSes and non-broken versions of FreeBSD from detecting the geometry. It would also prevent booting on BIOSes that uses the CHS values; this is not usually a problem now. So this misconfiguration will mainly mess up printing of the partition table in utilities that print the CHS utilities, and cause warnings in utilities that want the CHS values to be correct, and propagate the misconfiguratation. > If you then export that config file, and > re-run it, you are back to being aligned to CHS. Only if the config file is broken. > I would imagine that adjusting your CHS ~correctly~ in the config file will > allow you to to do it, but I have not found myself motivated to really learn > about adjusting those values. I will have to pick that up someday I suppose. Yes this is necessary and very easy. Just type in the same geometry that you need to type in at the start of interactive fdisk to avoid the above problems. > For informational purposes here is > 1) partition w/ offset > 2) show results > 3) export/import config > 4) show results with adjusted offset > >> root@:/root # fdisk -i ada0 >> Do you want to change our idea of what BIOS thinks ? [n] You do want the change this. Use something like 32 sectors 64 heads (1MB cylinders). >> Do you want to change it? [n] y >> Supply a decimal value for "sysid (165=FreeBSD)" [165] >> Supply a decimal value for "start" [63] 4096 >> Supply a decimal value for "size" [125045361] 125041328 >> Correct this automatically? [n] After changing "what BIOS thinks" to something like the above, fdisk shouldn't see anything to correct. The start of 4096 is a multiple of 32*64 = 2048. I just noticed that despite being too chatty, "what BIOS thinks" has bad grammar. >> Explicitly specify beg/end address ? [n] >> Are we happy with this entry? [n] y >> Do you want to change it? [n] >> Do you want to change it? [n] >> Do you want to change it? [n] >> Do you want to change the active partition? [n] >> Should we write new partition table? [n] y >> >> root@:/root # gpart show ada0 >> => 63 125045361 ada0 MBR (59G) >> 63 4033 - free - (2M) >> 4096 125041328 1 freebsd [active] (59G) Looks like it got a bogus 63 from the same place as fdisk (from the "firmware" goemetry. The free area isn't really 4033 block sectors at 63, but 4095 blocks starting at 1 (for non-broken BIOSes and OSes). I used to start partitions at offset 1, but now use 63 for portability. However, 63 isn't portable either. The BIOS or another OS might have a different idea of the geometry. >> root@:/root # fdisk -p ada0 >> # /dev/ada0 >> g c124053 h16 s63 >> p 1 0xa5 4096 125041328 >> a 1 This shows fdisk -p using the same garbage default geometry as interactive fdisk. So fdisk -p output is not directly usable. There is no place in the partition table to store the geometry directly. It can sometimes be determined indirectly, but neither the kernel nor fdisk does so. Old kernels did so, and fdisk depended on this. However, fdisk shouldn't depend on this, so that fdisk can work on images of partition tables. Linux fdisk (fdisk-linux in ports) does this correctly. Not using OS-specific ioctls for this also improves portability. The kernel support for this was mainly to ensure that all FreeBSD utilities got a consistent view of the geometry, back when consistent views mattered. It's still confusing when the views are different. >> root@:/root # fdisk -p ada0 > command >> >> root@:/root # fdisk -f command ada0 >> ******* Working on device /dev/ada0 ******* >> fdisk: WARNING line 2: number of cylinders (124053) may be out-of-range >> (must be within 1-1024 for normal BIOS operation, unless the entire >> disk >> is dedicated to FreeBSD) >> fdisk: WARNING: adjusting start offset of partition 1 >> from 4096 to 4158, to fall on a head boundary >> fdisk: WARNING: adjusting size of partition 1 from 125041328 to 125041266 >> to end on a cylinder boundary This is broken. The non-interactive version should not adjust anything. Even the interactive version defaults to not adjusting. >> root@:/root # fdisk -p ada0 >> # /dev/ada0 >> g c124053 h16 s63 >> p 1 0xa5 4158 125041266 >> a 1 >> >> root@:/root # gpart show ada0 >> => 63 125045361 ada0 MBR (59G) >> 63 4095 - free - (2M) >> 4158 125041266 1 freebsd [active] (59G) So the bugs of fdisk -p producing wrong geometry, not editing its output to fix this, and the broken adjustment done by fdisk -f, result in the partiton offsets being corrupted by fdisk -f. Bruce From owner-freebsd-fs@FreeBSD.ORG Wed Mar 13 14:12:35 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 4F97ACBB for ; Wed, 13 Mar 2013 14:12:35 +0000 (UTC) (envelope-from darksoul@darkbsd.org) Received: from denrei.darkbsd.org (denrei.darkbsd.org [91.121.179.66]) by mx1.freebsd.org (Postfix) with ESMTP id 01F9D8AB for ; Wed, 13 Mar 2013 14:12:34 +0000 (UTC) Received: from denrei.darkbsd.org (localhost [127.0.0.1]) by denrei.darkbsd.org (Postfix) with ESMTP id E65A4C81 for ; Wed, 13 Mar 2013 15:12:27 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=darkbsd.org; h=message-id :date:from:mime-version:to:subject:references:in-reply-to :content-type; s=selector1; bh=Jol/lfuPfs0++ZatAgIlehTA/wU=; b=q 7mKZoze8PyB2g3wrLbL8JiZFvoh0V4Rle1QQ9qYoxISyrszKBDdjOR/0dbF12fge s46LkEVHpzflLryTeORZueBwcRegVeCe+1C2e4Y8GP/3OHbOPb2LU8KLfHVLZQS4 kvGrhfcuBoFwPd8/kKzrxs3P9cWYo3o2I+mXknMrM8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=darkbsd.org; h=message-id :date:from:mime-version:to:subject:references:in-reply-to :content-type; q=dns; s=selector1; b=HqbJAvYKTRkBCFu/P0aKA2WoY7l 5YpZsO5WIq9xiCeGT9v7vgwjYW35eWg07RXRLmH1UpVyRRzgKvvgUWfjp13shu+b wGhdcZS14ha67K3QoQrqneun6+owtptaTf+9r7ifpQGAhTMDkzAiwgJJobyUmH5F kURYbb2Fai3gYEzs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=darkbsd.org; h= content-type:content-type:in-reply-to:references:subject:subject :mime-version:user-agent:from:from:date:date:message-id:received :received; s=selector1; t=1363183944; bh=4Nk4GeV3hp9KFCJNPlTRPK/ ZEUAJrPISFLxKtkwNUo0=; b=mBEe01WWLPWEVEUw69EED8gDHckEvwozyswywpA vHpr9KseP0EfZWjDyMprw15VK5aM14xgH7ZtyHxRlVoFYXRTTCcIB0ESBxX/9MHo 9S1CFqP0LOUbqyOpcR5kM9EJuMm4XG98+tSuEEhAe3VEz/bQSFOKExRU5W3/bHrb 1Y3Q= X-Virus-Scanned: amavisd-new at darkbsd.org Received: from denrei.darkbsd.org ([127.0.0.1]) by denrei.darkbsd.org (denrei.darkbsd.org [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 36xaddxBmxIn for ; Wed, 13 Mar 2013 15:12:24 +0100 (CET) Received: from [IPv6:2001:470:24:42d::42] (archer.yomi.darkbsd.org [IPv6:2001:470:24:42d::42]) (Authenticated sender: darksoul@darkbsd.org) by denrei.darkbsd.org (Postfix) with ESMTPSA id 566A7C80 for ; Wed, 13 Mar 2013 15:12:22 +0100 (CET) Message-ID: <51408940.9000609@darkbsd.org> Date: Wed, 13 Mar 2013 23:12:16 +0900 From: DarkSoul User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130106 Thunderbird/17.0.2 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Panic loop on ZFS with 9.1-RELEASE References: <513B58B6.2090903@darkbsd.org> <513B6E1E.6080805@darkbsd.org> <513B7555.1010701@darkbsd.org> In-Reply-To: <513B7555.1010701@darkbsd.org> X-Enigmail-Version: 1.4.6 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------enig2AE1FCAA1F9E474431CFBC3A" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Mar 2013 14:12:35 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig2AE1FCAA1F9E474431CFBC3A Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Just a very quick heads up. I still haven't succeeded in importing the pool readwrite, but I have succeeded in importing it readonly. This has been confirmed as a bug by the ZFS illumos ML people. Description : You can't import readonly a pool that has cache devices, because the import will try to send write IOs to auxiliary vdevs, and hit an assert() call, thus provoking a panic. Workaround : Destroy cache devices before zpool import -o readonly=on -f . Cheers, On 03/10/2013 02:45 AM, Stephane LAPIE wrote: > Pinpoint analysis of the zpool on the broken vdev gives the following > information : > > # zdb -AAA -e -mm prana 1 33 > > Metaslabs: > vdev 1 > metaslabs 145 offset spacemap free > --------------- ------------------- --------------- ------------- > metaslab 33 offset 21000000000 spacemap 303 free 11.9G > WARNING: zfs: allocating allocated segment(offset=2335563722752 size=1024) > > Assertion failed: sm->sm_space == space (0x2f927f400 == 0x2f927f800), > file > /usr/storage/tech/eirei-no-za.yomi.darkbsd.org/usr/src/cddl/lib/libzpool/../../../sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, > line 353. > pid 51 (zdb), uid 0: exited on signal 6 (core dumped) > Abort trap (core dumped) > > Just in case, root vdev 1 is made of the following devices : > children[1]: > type: 'raidz' > id: 1 > guid: 1078755695237588414 > nparity: 1 > metaslab_array: 175 > metaslab_shift: 36 > ashift: 9 > asize: 10001970626560 > is_log: 0 > children[0]: > type: 'disk' > id: 0 > guid: 12900041001921590764 > path: '/dev/da10' > phys_path: '/dev/da10' > whole_disk: 0 > DTL: 4127 > children[1]: > type: 'disk' > id: 1 > guid: 7211789756938666186 > path: '/dev/da3' > phys_path: '/dev/da3' > whole_disk: 1 > DTL: 4119 > children[2]: > type: 'disk' > id: 2 > guid: 12094368820342087236 > path: '/dev/da5' > phys_path: '/dev/da5' > whole_disk: 1 > DTL: 212 > children[3]: > type: 'disk' > id: 3 > guid: 6868867539761908697 > path: '/dev/da4' > phys_path: '/dev/da4' > whole_disk: 0 > DTL: 4173 > children[4]: > type: 'disk' > id: 4 > guid: 3091570768700552191 > path: '/dev/da6' > phys_path: '/dev/da6' > whole_disk: 0 > DTL: 4182 > > At this point I am nearly considering ripping these out and zpool > importing while ignoring missing devices... :/ > > On 03/10/2013 02:15 AM, Stephane LAPIE wrote: >> Posting a quick update. >> >> I ran a "zdb -emm" command to figure out what was going on, and it blew >> up in my face with an abort trap here : >> - vdev 0 has 145 metaslabs, which are cleared without any problems. >> - vdev 1 has 145 metaslabs, but fails in the middle : >> metaslab 32 offset 20000000000 spacemap 289 free 1.64G >> segments 19509 maxsize 41.7M freepct 2% >> metaslab 33 offset 21000000000 spacemap 303 free 11.9G >> error: zfs: allocating allocated segment(offset=2335563722752 size=1024) >> Abort trap(core dumped) >> >> Converting offset 2335563722752 from earlier kernel panic messages gives >> : 21fca723000, which matches the broken metaslab found by zdb. >> >> Is there anything I can do at this point, using zdb? >> It just sounds surrealistic I have ONE broken metaslab (seemingly?) and >> that I can't recover anything... >> >> Cheers, >> >> On 03/10/2013 12:43 AM, Stephane LAPIE wrote: >>> Hello list, >>> >>> I currently am faced with a sudden death case I can't understand at all, >>> and I would be very appreciating of any explanation or assistance :( >>> >>> Here is my current kernel version : >>> FreeBSD 9.1-STABLE FreeBSD 9.1-STABLE #5 r245055: Thu Jan 17 13:12:59 >>> JST 2013 >>> darksoul@eirei-no-za.yomi.darkbsd.org:/usr/obj/usr/storage/tech/eirei-no-za.yomi.darkbsd.org/usr/src/sys/DARK-2012KERN >>> amd64 >>> (Kernel is basically a lightened GENERIC kernel without VESA options and >>> unneeded controllers removed) >>> >>> The pool is a set of 3x raidz1 (5 drives), + 2 cache devices + mirrored >>> transaction log >>> >>> Booting and trying to import the pool is met with : >>> Solaris(panic): zfs: panic: allocating allocated >>> segment(offset=2335563722752 size=1024) >>> >>> Booting single mode on my emergency flash card with a base OS and zpool >>> import -o readonly=on is met with : >>> panic: solaris assert: zio->io_type != ZIO_TYPE_WRITE || >>> spa_writeable(spa), file: >>> /usr/storage/tech/eirei-no-za.yomi.darkbsd.org/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c, >>> line: 2461 >>> >>> I tried zpool import -F -f, zpool import -F -f -m after removing the >>> mirrored transaction log devices, but after 40s of trying to import, it >>> just blows up. >>> >>> I am currently running "zdb -emm" as per the procedure suggested here : >>> http://simplex.swordsaint.net/?p=199 if only to get some debug information. >>> >>> Thanks in advance for your time. >>> >>> Cheers, >>> >>> >>> -- >>> Stephane LAPIE, EPITA SRS, Promo 2005 >>> "Even when they have digital readouts, I can't understand them." >>> --MegaTokyo >> -- >> Stephane LAPIE, EPITA SRS, Promo 2005 >> "Even when they have digital readouts, I can't understand them." >> --MegaTokyo -- Stephane LAPIE, EPITA SRS, Promo 2005 "Even when they have digital readouts, I can't understand them." --MegaTokyo --------------enig2AE1FCAA1F9E474431CFBC3A Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iF4EAREIAAYFAlFAiUEACgkQDJ4OK7D3FWQcogEAs/t505xibhP4EWsRGiAF8+qO NDV/kBSdgU7Cd/UB118BALK603KdwiW4fxn/NnGBZa4T0k5NhUWUrwQ/YgjgUZWO =6itv -----END PGP SIGNATURE----- --------------enig2AE1FCAA1F9E474431CFBC3A-- From owner-freebsd-fs@FreeBSD.ORG Wed Mar 13 16:52:34 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 037127ED; Wed, 13 Mar 2013 16:52:34 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (chez.mckusick.com [IPv6:2001:5a8:4:7e72:4a5b:39ff:fe12:452]) by mx1.freebsd.org (Postfix) with ESMTP id A38E874F; Wed, 13 Mar 2013 16:52:33 +0000 (UTC) Received: from chez.mckusick.com (localhost [127.0.0.1]) by chez.mckusick.com (8.14.3/8.14.3) with ESMTP id r2DGqSr4051899; Wed, 13 Mar 2013 09:52:29 -0700 (PDT) (envelope-from mckusick@chez.mckusick.com) Message-Id: <201303131652.r2DGqSr4051899@chez.mckusick.com> To: Palle Girgensohn Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?), maybe after moving tables and indexes to tablespace on different volume In-reply-to: <51405391.1020006@FreeBSD.org> Date: Wed, 13 Mar 2013 09:52:28 -0700 From: Kirk McKusick X-Spam-Status: No, score=0.0 required=5.0 tests=MISSING_MID, UNPARSEABLE_RELAY autolearn=failed version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on chez.mckusick.com Cc: freebsd-fs@freebsd.org, Jeff Roberson X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Mar 2013 16:52:34 -0000 Thanks for your report. It is certainly unlike anything that we have seen reported before. Are you running your /usr filesystem with (the default) journalled soft updates? You can check this by running the `mount' command with no arguments. Rather than rebooting your system, it would be most helpful if you could instead shut it down to single user. Then do the following: Create a transcript of your session by running `script'. Once running in the session run these commands: Run `mount' to show your filesystem configuration. Run `df -hi /usr' to see whether the inodes are still missing. Verify that you can cleanly unmount /usr (e.g., that the unmount does not hang and does not complain). Remount /usr and run `df -hi' to see whether the inodes are still missing. Unmount /usr again and run `fsck_ffs -p -f -d /usr'. If the fsck_ffs fails with an unexpected inconsistency, you can run `fsck_ffs -y -d /usr' to force it to clean up. When you have the filesystem successfully cleaned up, type `exit' to get out of the script session and mail me the transcript of the session (typescript). Thanks for your help in tracking this down. Kirk McKusick ----- Original Message: Date: Wed, 13 Mar 2013 11:23:13 +0100 From: Palle Girgensohn To: freebsd-fs@freebsd.org Subject: leaking lots of unreferenced inodes (pg_xlog files?), maybe after moving tables and indexes to tablespace on different volume Hi! Running postgresql-9.2.2 on FreeBSD 9.1 amd64 using vanilla ufs file system. I have the postgresql base/ on the /usr disk, and a separate volume /opt where the default tablespace resides. This means that the amount of data on the /usr disk sould be stable. This is not the case, the disk usage grows linearly (it seems to leave many inodes unreferenced). The the discrepancy between df and du is now huge: # du -sxh /usr; df -h /usr 4,6G /usr Filesystem Size Used Avail Capacity Mounted on /dev/da0s1f 104G 88G 8.0G 92% /usr 4,6G vs 88GB, that must be more than a rounding error? Strange thing is I cannot find any open files among the missing. # lsof /usr| awk '{print $9}'|xargs ls -l > /dev/null returns no errors (a missing file would render an error with ls). If there where open files not referenced in any directory, they should be found. Next thing is fsck, and yes, there are plenty of unreferenced files. I ran fsck while system is running (i.e. read only) to get a grip oif the amount of lost inodes: fsck /usr | awk '{print $1}'|cut -f 2 -d=| perl -e '$i = 0; while (<>) { $i += $_;}; print $i / 1024 / 1024; print "\n";' 85223.3530330658 ~85 GB gone, that's 80% of the disk, and it accounts fo all the missing space. MTIME for the inodes are pretty evenly spread over time since the machine was updated to FreeBSD 9.1, rebooted, and PostgreSQL was updated to 9.2. All was done at the same time, so I can't really tell who's to blaim, but this is the only server, out of a dozen that where updated to exactly the same versions, that has this problem. All other servers have their /usr disk usage stable (since all data resides on a separate tablespace). The unreferenced inodes are almost exclusively around 16 MB in size, so they most certainly all are postgresql pg_xlog files. This means all files are lost from the same portion of code in the database engine. How could it possibly be able to leave unreferenced inodes around like this at such a scale? Is the culprit a combination of postgresql and file system code? Both where updated. pg_xlog checkpoints seems to happen approximately every three minutes: Mar 13 00:39:08 dbserver postgres[5298]: [48-1] db=,user= LOG: checkpoint starting: time Mar 13 00:41:38 dbserver postgres[5298]: [49-1] db=,user= LOG: checkpoint complete: wrote 2542 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=149.667 s, sync=0.101 s, total=149.770 s; sync files=628, longest=0.021 s, average=0.000 s Mar 13 00:44:08 dbserver postgres[5298]: [50-1] db=,user= LOG: checkpoint starting: time Mar 13 00:46:38 dbserver postgres[5298]: [51-1] db=,user= LOG: checkpoint complete: wrote 3996 buffers (0.4%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=149.438 s, sync=0.111 s, total=149.551 s; sync files=823, longest=0.006 s, average=0.000 s Mar 13 00:49:08 dbserver postgres[5298]: [52-1] db=,user= LOG: checkpoint starting: time Mar 13 00:51:38 dbserver postgres[5298]: [53-1] db=,user= LOG: checkpoint complete: wrote 13736 buffers (1.4%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=149.958 s, sync=0.311 s, total=150.271 s; sync files=1335, longest=0.079 s, average=0.000 s Mar 13 00:54:08 dbserver postgres[5298]: [54-1] db=,user= LOG: checkpoint starting: time Mar 13 00:56:38 dbserver postgres[5298]: [55-1] db=,user= LOG: checkpoint complete: wrote 14638 buffers (1.5%); 0 transaction log file(s) added, 0 removed, 17 recycled; write=149.330 s, sync=0.271 s, total=149.603 s; sync files=1363, longest=0.017 s, average=0.000 s Mar 13 00:59:08 dbserver postgres[5298]: [56-1] db=,user= LOG: checkpoint starting: time Mar 13 01:01:38 dbserver postgres[5298]: [57-1] db=,user= LOG: checkpoint complete: wrote 8035 buffers (0.8%); 0 transaction log file(s) added, 0 removed, 21 recycled; write=149.285 s, sync=0.146 s, total=149.433 s; sync files=1160, longest=0.003 s, average=0.000 s Mar 13 01:04:08 dbserver postgres[5298]: [58-1] db=,user= LOG: checkpoint starting: time Mar 13 01:06:37 dbserver postgres[5298]: [59-1] db=,user= LOG: checkpoint complete: wrote 2156 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 9 recycled; write=149.402 s, sync=0.057 s, total=149.461 s; sync files=610, longest=0.000 s, average=0.000 s Mar 13 01:09:08 dbserver postgres[5298]: [60-1] db=,user= LOG: checkpoint starting: time I'm pretty certain that unmounting the file system and running fsck will regain the lost space, but will it stop there? Stopping postgresql briefly did not help, I tried that. That would have helped if the files where open, but they're not. It seems to postgresql did the right thing, and FreeBSD failed to unreference the files. The server has about 30 databases and ~127 concurrent connections (not all beeing active simultaneously, though), so it is fair to say it is pretty active, but nothing extreme. Hardware is HP DL360, using their HT Smart Array P410i. Any ideas how to debug this? Or shall I just reboot, fsck, hope the problem will go away, and when it does, forget about it? Thanks, Palle -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJRQFORAAoJEIhV+7FrxBJDzVUIAJHU011JDxLxj8/xg05Gwhgq XK3xB+0N0NSUQ50yhcRKLINz/j/XfeS0ZxlH+MstaPA9y0r1JUXMxkb/uTUvGBiy jutk3eVe0cati9cVZbJkRU5FxEgmQ0fg0GOMl3RQAErkh5achj+klWvN7PnwGjTs O3L9RgckKuxTJffk52GAS05qY/TKR6f08kdX3I2cFtqw3tyTyrXU0JPdk2snuPhv H40xV46zgtWMFDvZLt61MryQ7/JotVQwU78scUB+zxrf8KKM9V0mM7pk0pIbG4Qw NJBpZJ5gjbl4x+dkQrtZdL65yq88hACYwo9D+83Ct4ig8tgcQ7ViNHWxJqknK7Q= =3ZZs -----END PGP SIGNATURE----- _______________________________________________ freebsd-fs@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Wed Mar 13 19:20:35 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 22B1476F; Wed, 13 Mar 2013 19:20:35 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id AF1F3129; Wed, 13 Mar 2013 19:20:33 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id E70C5B999; Wed, 13 Mar 2013 15:20:32 -0400 (EDT) From: John Baldwin To: fs@freebsd.org Subject: Deadlock in the NFS client Date: Wed, 13 Mar 2013 13:56:37 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p25; KDE/4.5.5; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201303131356.37919.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 13 Mar 2013 15:20:33 -0400 (EDT) Cc: Rick Macklem X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Mar 2013 19:20:35 -0000 I ran into a machine that had a deadlock among certain files on a given NFS mount today. I'm not sure how best to resolve it, though it seems like perhaps there is a bug with how the pool of nfsiod threads is managed. Anyway, more details on the actual hang below. This was on 8.x with the old NFS client, but I don't see anything in HEAD that would fix this. First note that the system was idle so it had dropped down to only one nfsiod thread. The nfsiod thread is hung on a vnode lock: (kgdb) proc 36927 [Switching to thread 150 (Thread 100679)]#0 sched_switch ( td=0xffffff0320de88c0, newtd=0xffffff0003521460, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/sched_ule.c:1898 1898 cpuid = PCPU_GET(cpuid); (kgdb) where #0 sched_switch (td=0xffffff0320de88c0, newtd=0xffffff0003521460, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/sched_ule.c:1898 #1 0xffffffff80407953 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:449 #2 0xffffffff8043e342 in sleepq_wait (wchan=0xffffff0358bbb7f8, pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:629 #3 0xffffffff803e5755 in __lockmgr_args (lk=0xffffff0358bbb7f8, flags=524544, ilk=0xffffff0358bbb820, wmesg=Variable "wmesg" is not available. ) at /usr/src/sys/kern/kern_lock.c:220 #4 0xffffffff80489219 in vop_stdlock (ap=Variable "ap" is not available. ) at lockmgr.h:94 #5 0xffffffff80697322 in VOP_LOCK1_APV (vop=0xffffffff80892b00, a=0xffffff847ac10600) at vnode_if.c:1988 #6 0xffffffff804a8bb7 in _vn_lock (vp=0xffffff0358bbb760, flags=524288, file=0xffffffff806fa421 "/usr/src/sys/kern/vfs_subr.c", line=2138) at vnode_if.h:859 #7 0xffffffff8049b680 in vget (vp=0xffffff0358bbb760, flags=524544, td=0xffffff0320de88c0) at /usr/src/sys/kern/vfs_subr.c:2138 #8 0xffffffff8048d4aa in vfs_hash_get (mp=0xffffff004a3a0000, hash=227722108, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/vfs_hash.c:81 #9 0xffffffff805631f6 in nfs_nget (mntp=0xffffff004a3a0000, fhp=0xffffff03771eed56, fhsize=32, npp=0xffffff847ac10a40, flags=524288) at /usr/src/sys/nfsclient/nfs_node.c:120 #10 0xffffffff80570229 in nfs_readdirplusrpc (vp=0xffffff0179902760, uiop=0xffffff847ac10ad0, cred=0xffffff005587c300) at /usr/src/sys/nfsclient/nfs_vnops.c:2636 ---Type to continue, or q to quit--- #11 0xffffffff8055f144 in nfs_doio (vp=0xffffff0179902760, bp=0xffffff83e05c5860, cr=0xffffff005587c300, td=Variable "td" is not available. ) at /usr/src/sys/nfsclient/nfs_bio.c:1600 #12 0xffffffff8056770a in nfssvc_iod (instance=Variable "instance" is not available. ) at /usr/src/sys/nfsclient/nfs_nfsiod.c:303 #13 0xffffffff803d0c2f in fork_exit (callout=0xffffffff805674b0 , arg=0xffffffff809266e0, frame=0xffffff847ac10c40) at /usr/src/sys/kern/kern_fork.c:861 Thread stuck in getblk for that vnode (holds shared lock on this vnode): (kgdb) proc 36902 [Switching to thread 149 (Thread 101543)]#0 sched_switch ( td=0xffffff0378d8a8c0, newtd=0xffffff0003521460, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/sched_ule.c:1898 1898 cpuid = PCPU_GET(cpuid); (kgdb) where #0 sched_switch (td=0xffffff0378d8a8c0, newtd=0xffffff0003521460, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/sched_ule.c:1898 #1 0xffffffff80407953 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:449 #2 0xffffffff8043e342 in sleepq_wait (wchan=0xffffff83e11bc1c0, pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:629 #3 0xffffffff803e5755 in __lockmgr_args (lk=0xffffff83e11bc1c0, flags=530688, ilk=0xffffff0358bbb878, wmesg=Variable "wmesg" is not available. ) at /usr/src/sys/kern/kern_lock.c:220 #4 0xffffffff80483aeb in getblk (vp=0xffffff0358bbb760, blkno=3, size=32768, slpflag=0, slptimeo=0, flags=0) at lockmgr.h:94 #5 0xffffffff8055e963 in nfs_getcacheblk (vp=0xffffff0358bbb760, bn=3, size=32768, td=0xffffff0378d8a8c0) at /usr/src/sys/nfsclient/nfs_bio.c:1259 #6 0xffffffff805627d9 in nfs_bioread (vp=0xffffff0358bbb760, uio=0xffffff847bcf0ad0, ioflag=Variable "ioflag" is not available. ) at /usr/src/sys/nfsclient/nfs_bio.c:530 #7 0xffffffff806956a4 in VOP_READ_APV (vop=0xffffffff808a29a0, a=0xffffff847bcf09c0) at vnode_if.c:887 #8 0xffffffff804a9e27 in vn_read (fp=0xffffff03b6a506e0, uio=0xffffff847bcf0ad0, active_cred=Variable "active_cred" is not available. ) at vnode_if.h:384 #9 0xffffffff80444fb1 in dofileread (td=0xffffff0378d8a8c0, fd=3, fp=0xffffff03b6a506e0, auio=0xffffff847bcf0ad0, offset=Variable "offset" is not available. ) at file.h:242 The buffer is locked by LK_KERNPROC: (kgdb) bprint bp 0xffffff83e11bc128: BIO_READ flags (ASYNC|VMIO) error = 0, bufsize = 32768, bcount = 32768, b_resid = 0 bufobj = 0xffffff0358bbb878, data = 0xffffff8435a05000, blkno = d, dep = 0x0 lock type bufwait: EXCL by LK_KERNPROC with exclusive waiters pending And this buffer is queued as the first pending buffer on the mount waiting for service by nfsiod: (kgdb) set $nmp = (struct nfsmount *)vp->v_mount->mnt_data (kgdb) p $nmp->nm_bufq.tqh_first $24 = (struct buf *) 0xffffff83e11bc128 (kgdb) p bp $25 = (struct buf *) 0xffffff83e11bc128 So, the first process is waiting for a block from an NFS directory. That block is in queue to be completed as an async I/O by the nfsiod thread pool. However, the lone nfsiod thread in the pool is waiting to exclusively lock the original NFS directory to update its attributes, so it cannot service the async I/O request. -- John Baldwin From owner-freebsd-fs@FreeBSD.ORG Wed Mar 13 21:45:58 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 703CE795 for ; Wed, 13 Mar 2013 21:45:58 +0000 (UTC) (envelope-from cr@caltel.com) Received: from mail1.caltel.com (mail1.caltel.com [66.102.144.6]) by mx1.freebsd.org (Postfix) with ESMTP id 53895E70 for ; Wed, 13 Mar 2013 21:45:58 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAFzyQFFCZpCq/2dsb2JhbABDxFaBb3SCKgEBAQMBAQI1QAYLCxgJFg8JAwIBAgEWLxMIAQGICgYMwxuNYIE3g0ADiHSLJoI+gR+ES4sYgyoc X-IPAS-Result: AqAEAFzyQFFCZpCq/2dsb2JhbABDxFaBb3SCKgEBAQMBAQI1QAYLCxgJFg8JAwIBAgEWLxMIAQGICgYMwxuNYIE3g0ADiHSLJoI+gR+ES4sYgyoc X-IronPort-AV: E=Sophos;i="4.84,840,1355126400"; d="scan'208";a="17130627" Received: from host-170.a66-102-144.caltel.com (HELO codys-mac.local) ([66.102.144.170]) by smtp.caltel.com with ESMTP/TLS/DHE-RSA-CAMELLIA256-SHA; 13 Mar 2013 14:45:23 -0700 Message-ID: <5140F373.1010907@caltel.com> Date: Wed, 13 Mar 2013 14:45:23 -0700 From: Cody Ritts Organization: CalTel User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130307 Thunderbird/17.0.4 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> <1362951595.99445.2.camel@btw.pki2.com> <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org> <513F8F04.60206@caltel.com> <20130313232247.B1078@besplex.bde.org> In-Reply-To: <20130313232247.B1078@besplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Mar 2013 21:45:58 -0000 Holy crap, there is so much baggage with CHS, I had no idea we were still dragging it along to this extent. I really appreciate you emphasizing its importance, I was ready to just blow off the incorrect geometry and be happy with my hacked sector start. So here is my drive... > root@:/root # diskinfo -v ada0 > ada0 > 512 # sectorsize > 64023257088 # mediasize in bytes (59G) > 125045424 # mediasize in sectors > 0 # stripesize > 0 # stripeoffset > 124053 # Cylinders according to firmware. > 16 # Heads according to firmware. > 63 # Sectors according to firmware. So, if I now want to create an aligned single partition, here are the steps I think I should be taking: Sectors should be < 64 Heads should be < 256 for OLD OLD stuff, cylinders should be < 1024 if you want boundaries on a power of 2, those the number of sectors and heads should also be a power of 2. So, would all of these be potential valid values? s32 h128 512*32*128 = 2097152B = 2MB cylinder s32 h64 512*32*64 = 1048576B = 1MB cylinder s16 h128 512*16*128 = 1048576B = 1MB cylinder s4 h8 512*4*4 = 8192B = 8K cylinder I am assuming that once I know my cylinder size, I just divide the total size of my hard drive to come up with cylinder count? s4 h8 64023257088 / 8192 = 7815339c (8k is the largest power of 2 that the drive will evenly divide into) s32 h64 64023257088 / 1048576 = 61057.3359375 Round down to 61057. (does the cylinder need to end on the end of the disk?) So, here is what i calculated: c61057 h64 s32 I want an offset of 2M, file system should be reduced to 61055M (61055 * 1024 * 1024)/512 = 125040640s) Here are the commands that I ran: > cat << EOF > command > g c61057 h64 s32 > p 1 0xa5 4096 125040640 > a 1 > EOF > root@:/root # fdisk -f command ada0 > ******* Working on device /dev/ada0 ******* > fdisk: WARNING line 1: number of cylinders (61057) may be out-of-range > (must be within 1-1024 for normal BIOS operation, unless the entire disk > is dedicated to FreeBSD) > root@:/root # fdisk -p ada0 > # /dev/ada0 > g c124053 h16 s63 > p 1 0xa5 4096 125040640 > a 1 note, it auto goes back when exporting > root@:/root # gpart show ada0 > => 63 125045361 ada0 MBR (59G) > 63 4033 - free - (2M) > 4096 125040640 1 freebsd [active] (59G) > 125044736 688 - free - (344k) > root@:/root # gpart delete -i 1 ada0 > root@:/root # gpart add -t freebsd -b 4096 -s 125040640 ada0 > ada0s1 added > root@:/root # gpart show ada0 > => 63 125045361 ada0 MBR (59G) > 63 4095 - free - (2M) > 4158 125040573 1 freebsd (59G) > 125044731 693 - free - (346k) gpart does not care > root@:/root # fdisk -f command ada0 > ******* Working on device /dev/ada0 ******* > fdisk: WARNING line 1: number of cylinders (61057) may be out-of-range > (must be within 1-1024 for normal BIOS operation, unless the entire disk > is dedicated to FreeBSD) > root@:/root # fdisk ada0 > ******* Working on device /dev/ada0 ******* > parameters extracted from in-core disklabel are: > cylinders=124053 heads=16 sectors/track=63 (1008 blks/cyl) > > Figures below won't work with BIOS for partitions not in cyl 1 > parameters to be used for BIOS calculations are: > cylinders=124053 heads=16 sectors/track=63 (1008 blks/cyl) > > Media sector size is 512 > Warning: BIOS sector numbering starts with sector 1 > Information from DOS bootblock is: > The data for partition 1 is: > sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) > start 4096, size 125040640 (61055 Meg), flag 80 (active) > beg: cyl 2/ head 0/ sector 1; > end: cyl 640/ head 63/ sector 32 > The data for partition 2 is: > > The data for partition 3 is: > > The data for partition 4 is: > So, setting the geom simply does this: >> sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) >> start 4096, size 125040640 (61055 Meg), flag 80 (active) >> beg: cyl 2/ head 0/ sector 1; >> end: cyl 640/ head 63/ sector 32 I cannot set geom in my bios, nor does not show me what it thinks geom is. Obviously anything that only supports 1024 cylinders will not think it is very funny. I feel like I am missing some part of this puzzle, or is that all there is to this to correct geom for proper alignment on an MBR? So, by setting those CHS values I am: making the partition table more compatible with other operating systems and BIOSes? and giving some utilities the CHS stuff they need to function right? Thanks, Cody On 3/13/13 6:33 AM, Bruce Evans wrote: > On Tue, 12 Mar 2013, Cody Ritts wrote: > >> On 3/12/13 3:25 AM, Bruce Evans wrote: >>>> Update -- >>>> >>>> fdisk WILL allow you to align without regards to drive geometry >>>> >>>> It can only be done in interactive mode: >>>> http://lists.freebsd.org/pipermail/freebsd-geom/2011-May/004780.html >>> >>> It can be set in all modes. At least according to the man page: >> >> In interactive mode, you can simply set the start and size of your >> partition be done with it and boot away. > > That will usually give nonsense beginning and ending CHS values. This will > usually prevent BIOSes and non-broken versions of FreeBSD from detecting > the geometry. It would also prevent booting on BIOSes that uses the CHS > values; this is not usually a problem now. So this misconfiguration will > mainly mess up printing of the partition table in utilities that print > the CHS utilities, and cause warnings in utilities that want the CHS values > to be correct, and propagate the misconfiguratation. > >> If you then export that config file, and re-run it, you are back to >> being aligned to CHS. > > Only if the config file is broken. > >> I would imagine that adjusting your CHS ~correctly~ in the config file >> will allow you to to do it, but I have not found myself motivated to >> really learn about adjusting those values. I will have to pick that >> up someday I suppose. > > Yes this is necessary and very easy. Just type in the same geometry that > you need to type in at the start of interactive fdisk to avoid the above > problems. > >> For informational purposes here is >> 1) partition w/ offset >> 2) show results >> 3) export/import config >> 4) show results with adjusted offset >> >>> root@:/root # fdisk -i ada0 >>> Do you want to change our idea of what BIOS thinks ? [n] > > You do want the change this. Use something like 32 sectors 64 heads > (1MB cylinders). > >>> Do you want to change it? [n] y >>> Supply a decimal value for "sysid (165=FreeBSD)" [165] >>> Supply a decimal value for "start" [63] 4096 >>> Supply a decimal value for "size" [125045361] 125041328 >>> Correct this automatically? [n] > > After changing "what BIOS thinks" to something like the above, fdisk > shouldn't see anything to correct. The start of 4096 is a multiple > of 32*64 = 2048. > > I just noticed that despite being too chatty, "what BIOS thinks" > has bad grammar. > >>> Explicitly specify beg/end address ? [n] >>> Are we happy with this entry? [n] y >>> Do you want to change it? [n] >>> Do you want to change it? [n] >>> Do you want to change it? [n] >>> Do you want to change the active partition? [n] >>> Should we write new partition table? [n] y >>> >>> root@:/root # gpart show ada0 >>> => 63 125045361 ada0 MBR (59G) >>> 63 4033 - free - (2M) >>> 4096 125041328 1 freebsd [active] (59G) > > Looks like it got a bogus 63 from the same place as fdisk (from the > "firmware" goemetry. > > The free area isn't really 4033 block sectors at 63, but 4095 blocks > starting at 1 (for non-broken BIOSes and OSes). I used to start > partitions at offset 1, but now use 63 for portability. However, 63 > isn't portable either. The BIOS or another OS might have a different > idea of the geometry. > >>> root@:/root # fdisk -p ada0 >>> # /dev/ada0 >>> g c124053 h16 s63 >>> p 1 0xa5 4096 125041328 >>> a 1 > > This shows fdisk -p using the same garbage default geometry as interactive > fdisk. So fdisk -p output is not directly usable. > > There is no place in the partition table to store the geometry directly. > It can sometimes be determined indirectly, but neither the kernel nor > fdisk does so. Old kernels did so, and fdisk depended on this. > However, fdisk shouldn't depend on this, so that fdisk can work on > images of partition tables. Linux fdisk (fdisk-linux in ports) does > this correctly. Not using OS-specific ioctls for this also improves > portability. The kernel support for this was mainly to ensure that > all FreeBSD utilities got a consistent view of the geometry, back when > consistent views mattered. It's still confusing when the views are > different. > >>> root@:/root # fdisk -p ada0 > command >>> >>> root@:/root # fdisk -f command ada0 >>> ******* Working on device /dev/ada0 ******* >>> fdisk: WARNING line 2: number of cylinders (124053) may be out-of-range >>> (must be within 1-1024 for normal BIOS operation, unless the >>> entire disk >>> is dedicated to FreeBSD) >>> fdisk: WARNING: adjusting start offset of partition 1 >>> from 4096 to 4158, to fall on a head boundary >>> fdisk: WARNING: adjusting size of partition 1 from 125041328 to >>> 125041266 >>> to end on a cylinder boundary > > This is broken. The non-interactive version should not adjust anything. > Even the interactive version defaults to not adjusting. > >>> root@:/root # fdisk -p ada0 >>> # /dev/ada0 >>> g c124053 h16 s63 >>> p 1 0xa5 4158 125041266 >>> a 1 >>> >>> root@:/root # gpart show ada0 >>> => 63 125045361 ada0 MBR (59G) >>> 63 4095 - free - (2M) >>> 4158 125041266 1 freebsd [active] (59G) > > So the bugs of fdisk -p producing wrong geometry, not editing its output > to fix this, and the broken adjustment done by fdisk -f, result in the > partiton offsets being corrupted by fdisk -f. > > Bruce > From owner-freebsd-fs@FreeBSD.ORG Wed Mar 13 23:33:37 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 64B8B38F; Wed, 13 Mar 2013 23:33:37 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id E59E1DD2; Wed, 13 Mar 2013 23:33:36 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqEEABoMQVGDaFvO/2dsb2JhbAA7CIgkuV6CXYFwdIIqAQEEASMEUgUWDgoCAg0ZAlkGiCEGr26SQxeBI4w4gQE0B4ItgRMDlliRAoMmIIFs X-IronPort-AV: E=Sophos;i="4.84,840,1355115600"; d="scan'208";a="21123471" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 13 Mar 2013 19:33:35 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 6C4EDB402D; Wed, 13 Mar 2013 19:33:35 -0400 (EDT) Date: Wed, 13 Mar 2013 19:33:35 -0400 (EDT) From: Rick Macklem To: John Baldwin Message-ID: <492562517.3880600.1363217615412.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201303131356.37919.jhb@freebsd.org> Subject: Re: Deadlock in the NFS client MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Mar 2013 23:33:37 -0000 John Baldwin wrote: > I ran into a machine that had a deadlock among certain files on a > given NFS > mount today. I'm not sure how best to resolve it, though it seems like > perhaps there is a bug with how the pool of nfsiod threads is managed. > Anyway, more details on the actual hang below. This was on 8.x with > the > old NFS client, but I don't see anything in HEAD that would fix this. > > First note that the system was idle so it had dropped down to only one > nfsiod thread. > Hmm, I see the problem and I'm a bit surprised it doesn't bite more often. It seems to me that this snippet of code from nfs_asyncio() makes too weak an assumption: /* * If none are free, we may already have an iod working on this mount * point. If so, it will process our request. */ if (!gotiod) { if (nmp->nm_bufqiods > 0) { NFS_DPF(ASYNCIO, ("nfs_asyncio: %d iods are already processing mount %p\n", nmp->nm_bufqiods, nmp)); gotiod = TRUE; } } It assumes that, since an nfsiod thread is processing some buffer for the mount, it will become available to do this one, which isn't true for your deadlock. I think the simple fix would be to recode nfs_asyncio() so that it only returns 0 if it finds an AVAILABLE nfsiod thread that it has assigned to do the I/O, getting rid of the above. The problem with doing this is that it may result in a lot more synchronous I/O (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe more synchronous I/O could be avoided by allowing nfs_asyncio() to create a new thread even if the total is above nfs_iodmax. (I think this would require the fixed array to be replaced with a linked list and might result in a large number of nfsiod threads.) Maybe just having a large nfs_iodmax would be an adequate compromise? Does having a large # of nfsiod threads cause any serious problem for most systems these days? I'd be tempted to recode nfs_asyncio() as above and then, instead of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed number of nfsiod threads (this could be a tunable, with the understanding that it should be large for good performance) rick > The nfsiod thread is hung on a vnode lock: > > (kgdb) proc 36927 > [Switching to thread 150 (Thread 100679)]#0 sched_switch ( > td=0xffffff0320de88c0, newtd=0xffffff0003521460, flags=Variable > "flags" is > not available. > ) > at /usr/src/sys/kern/sched_ule.c:1898 > 1898 cpuid = PCPU_GET(cpuid); > (kgdb) where > #0 sched_switch (td=0xffffff0320de88c0, newtd=0xffffff0003521460, > flags=Variable "flags" is not available. > ) > at /usr/src/sys/kern/sched_ule.c:1898 > #1 0xffffffff80407953 in mi_switch (flags=260, newtd=0x0) > at /usr/src/sys/kern/kern_synch.c:449 > #2 0xffffffff8043e342 in sleepq_wait (wchan=0xffffff0358bbb7f8, > pri=96) > at /usr/src/sys/kern/subr_sleepqueue.c:629 > #3 0xffffffff803e5755 in __lockmgr_args (lk=0xffffff0358bbb7f8, > flags=524544, > ilk=0xffffff0358bbb820, wmesg=Variable "wmesg" is not available. > ) at /usr/src/sys/kern/kern_lock.c:220 > #4 0xffffffff80489219 in vop_stdlock (ap=Variable "ap" is not > available. > ) at lockmgr.h:94 > #5 0xffffffff80697322 in VOP_LOCK1_APV (vop=0xffffffff80892b00, > a=0xffffff847ac10600) at vnode_if.c:1988 > #6 0xffffffff804a8bb7 in _vn_lock (vp=0xffffff0358bbb760, > flags=524288, > file=0xffffffff806fa421 "/usr/src/sys/kern/vfs_subr.c", line=2138) > at vnode_if.h:859 > #7 0xffffffff8049b680 in vget (vp=0xffffff0358bbb760, flags=524544, > td=0xffffff0320de88c0) at /usr/src/sys/kern/vfs_subr.c:2138 > #8 0xffffffff8048d4aa in vfs_hash_get (mp=0xffffff004a3a0000, > hash=227722108, > flags=Variable "flags" is not available. > ) at /usr/src/sys/kern/vfs_hash.c:81 > #9 0xffffffff805631f6 in nfs_nget (mntp=0xffffff004a3a0000, > fhp=0xffffff03771eed56, fhsize=32, npp=0xffffff847ac10a40, > flags=524288) > at /usr/src/sys/nfsclient/nfs_node.c:120 > #10 0xffffffff80570229 in nfs_readdirplusrpc (vp=0xffffff0179902760, > uiop=0xffffff847ac10ad0, cred=0xffffff005587c300) > at /usr/src/sys/nfsclient/nfs_vnops.c:2636 > ---Type to continue, or q to quit--- > #11 0xffffffff8055f144 in nfs_doio (vp=0xffffff0179902760, > bp=0xffffff83e05c5860, cr=0xffffff005587c300, td=Variable "td" is not > available. > ) > at /usr/src/sys/nfsclient/nfs_bio.c:1600 > #12 0xffffffff8056770a in nfssvc_iod (instance=Variable "instance" is > not > available. > ) > at /usr/src/sys/nfsclient/nfs_nfsiod.c:303 > #13 0xffffffff803d0c2f in fork_exit (callout=0xffffffff805674b0 > , > arg=0xffffffff809266e0, frame=0xffffff847ac10c40) > at /usr/src/sys/kern/kern_fork.c:861 > > Thread stuck in getblk for that vnode (holds shared lock on this > vnode): > > (kgdb) proc 36902 > [Switching to thread 149 (Thread 101543)]#0 sched_switch ( > td=0xffffff0378d8a8c0, newtd=0xffffff0003521460, flags=Variable > "flags" is > not available. > ) > at /usr/src/sys/kern/sched_ule.c:1898 > 1898 cpuid = PCPU_GET(cpuid); > (kgdb) where > #0 sched_switch (td=0xffffff0378d8a8c0, newtd=0xffffff0003521460, > flags=Variable "flags" is not available. > ) > at /usr/src/sys/kern/sched_ule.c:1898 > #1 0xffffffff80407953 in mi_switch (flags=260, newtd=0x0) > at /usr/src/sys/kern/kern_synch.c:449 > #2 0xffffffff8043e342 in sleepq_wait (wchan=0xffffff83e11bc1c0, > pri=96) > at /usr/src/sys/kern/subr_sleepqueue.c:629 > #3 0xffffffff803e5755 in __lockmgr_args (lk=0xffffff83e11bc1c0, > flags=530688, > ilk=0xffffff0358bbb878, wmesg=Variable "wmesg" is not available. > ) at /usr/src/sys/kern/kern_lock.c:220 > #4 0xffffffff80483aeb in getblk (vp=0xffffff0358bbb760, blkno=3, > size=32768, > slpflag=0, slptimeo=0, flags=0) at lockmgr.h:94 > #5 0xffffffff8055e963 in nfs_getcacheblk (vp=0xffffff0358bbb760, bn=3, > size=32768, td=0xffffff0378d8a8c0) at > /usr/src/sys/nfsclient/nfs_bio.c:1259 > #6 0xffffffff805627d9 in nfs_bioread (vp=0xffffff0358bbb760, > uio=0xffffff847bcf0ad0, ioflag=Variable "ioflag" is not available. > ) at /usr/src/sys/nfsclient/nfs_bio.c:530 > #7 0xffffffff806956a4 in VOP_READ_APV (vop=0xffffffff808a29a0, > a=0xffffff847bcf09c0) at vnode_if.c:887 > #8 0xffffffff804a9e27 in vn_read (fp=0xffffff03b6a506e0, > uio=0xffffff847bcf0ad0, active_cred=Variable "active_cred" is not > available. > ) at vnode_if.h:384 > #9 0xffffffff80444fb1 in dofileread (td=0xffffff0378d8a8c0, fd=3, > fp=0xffffff03b6a506e0, auio=0xffffff847bcf0ad0, offset=Variable > "offset" > is not available. > ) at file.h:242 > > The buffer is locked by LK_KERNPROC: > > (kgdb) bprint bp > 0xffffff83e11bc128: BIO_READ flags (ASYNC|VMIO) > error = 0, bufsize = 32768, bcount = 32768, b_resid = 0 > bufobj = 0xffffff0358bbb878, data = 0xffffff8435a05000, blkno = d, dep > = 0x0 > lock type bufwait: EXCL by LK_KERNPROC with exclusive waiters pending > > And this buffer is queued as the first pending buffer on the mount > waiting > for service by nfsiod: > > (kgdb) set $nmp = (struct nfsmount *)vp->v_mount->mnt_data > (kgdb) p $nmp->nm_bufq.tqh_first > $24 = (struct buf *) 0xffffff83e11bc128 > (kgdb) p bp > $25 = (struct buf *) 0xffffff83e11bc128 > > So, the first process is waiting for a block from an NFS directory. > That > block is in queue to be completed as an async I/O by the nfsiod thread > pool. > However, the lone nfsiod thread in the pool is waiting to exclusively > lock the > original NFS directory to update its attributes, so it cannot service > the > async I/O request. > > -- > John Baldwin From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 01:20:30 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 4B5DB987; Thu, 14 Mar 2013 01:20:30 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id E1B22D42; Thu, 14 Mar 2013 01:20:29 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAHEkQVGDaFvO/2dsb2JhbABDiCi8PIF0dIIqAQEFIwRSGw4KAgINGQJZBognrzGSVIEjjTk0B4ItgRMDlliRAoMmIIFs X-IronPort-AV: E=Sophos;i="4.84,840,1355115600"; d="scan'208";a="21133806" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 13 Mar 2013 21:20:28 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 829BAB4023; Wed, 13 Mar 2013 21:20:28 -0400 (EDT) Date: Wed, 13 Mar 2013 21:20:28 -0400 (EDT) From: Rick Macklem To: John Baldwin Message-ID: <1040319431.3883577.1363224028494.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201303131356.37919.jhb@freebsd.org> Subject: Re: Deadlock in the NFS client MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 01:20:30 -0000 I wrote: > Does having a large # of nfsiod threads cause any serious problem for most > systems these days? > > I'd be tempted to recode nfs_asyncio() as above and then, instead of nfs_iodmin > and nfs_iodmax, I'd simply have: > - a fixed number of nfsiod threads (this could be a tunable, with the > understanding that it should be large for good performance) I'm probably getting ahead of myself here, since changing nfs_asyncio() may/may not fix the deadlock, but I thought I'd comment further on the above. It may be possible to add a new nfs_iod_target (the desired # of nfsiod threads) and adjust that dynamically based on the ratio of the # of times nfs_asyncio() returns: #EIO/#0 --> when there are too many EIO returns, increase nfs_iod_target --> very few EIO returns, decrease nfs_iod_target - Use nfs_iodmin, nfs_iodmax as the limits for nfs_iod_target and set nfs_iodmax much larger than it currently is, by default, with nfs_iod_target set to what nfs_iodmax is currently set to, by default. rick From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 03:45:07 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 99794467; Thu, 14 Mar 2013 03:45:07 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 0C359F9D; Thu, 14 Mar 2013 03:45:06 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqEEAOBGQVGDaFvO/2dsb2JhbAA7CIgxuV+CXYF7dIIqAQEEASMEUgUWDgoCAg0ZAlkGiCEGrxuSVYEjjDiBATQHgi2BEwOWWJECgyYggWw X-IronPort-AV: E=Sophos;i="4.84,842,1355115600"; d="scan'208";a="18907384" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu.net.uoguelph.ca with ESMTP; 13 Mar 2013 23:45:05 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id AEA36B4022; Wed, 13 Mar 2013 23:45:05 -0400 (EDT) Date: Wed, 13 Mar 2013 23:45:05 -0400 (EDT) From: Rick Macklem To: John Baldwin Message-ID: <1881919310.3887914.1363232705678.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201303131356.37919.jhb@freebsd.org> Subject: Re: Deadlock in the NFS client MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 03:45:07 -0000 John Baldwin wrote: > I ran into a machine that had a deadlock among certain files on a > given NFS > mount today. I'm not sure how best to resolve it, though it seems like > perhaps there is a bug with how the pool of nfsiod threads is managed. > Anyway, more details on the actual hang below. This was on 8.x with > the > old NFS client, but I don't see anything in HEAD that would fix this. > > First note that the system was idle so it had dropped down to only one > nfsiod thread. > Oh, and one more thing... - I think this is specific to readdirplus, since I think it's the only rpc done by the nfsiod threads that tries to lock vnodes. Since readdirplus isn't the default for mounts and I think it also requires a shortage of nfsiod threads and reading a directory and its subdirectory conncurrently for this deadlock to occur, that would explain why it hasn't been a seen more often, I think. rick > The nfsiod thread is hung on a vnode lock: > > (kgdb) proc 36927 > [Switching to thread 150 (Thread 100679)]#0 sched_switch ( > td=0xffffff0320de88c0, newtd=0xffffff0003521460, flags=Variable > "flags" is > not available. > ) > at /usr/src/sys/kern/sched_ule.c:1898 > 1898 cpuid = PCPU_GET(cpuid); > (kgdb) where > #0 sched_switch (td=0xffffff0320de88c0, newtd=0xffffff0003521460, > flags=Variable "flags" is not available. > ) > at /usr/src/sys/kern/sched_ule.c:1898 > #1 0xffffffff80407953 in mi_switch (flags=260, newtd=0x0) > at /usr/src/sys/kern/kern_synch.c:449 > #2 0xffffffff8043e342 in sleepq_wait (wchan=0xffffff0358bbb7f8, > pri=96) > at /usr/src/sys/kern/subr_sleepqueue.c:629 > #3 0xffffffff803e5755 in __lockmgr_args (lk=0xffffff0358bbb7f8, > flags=524544, > ilk=0xffffff0358bbb820, wmesg=Variable "wmesg" is not available. > ) at /usr/src/sys/kern/kern_lock.c:220 > #4 0xffffffff80489219 in vop_stdlock (ap=Variable "ap" is not > available. > ) at lockmgr.h:94 > #5 0xffffffff80697322 in VOP_LOCK1_APV (vop=0xffffffff80892b00, > a=0xffffff847ac10600) at vnode_if.c:1988 > #6 0xffffffff804a8bb7 in _vn_lock (vp=0xffffff0358bbb760, > flags=524288, > file=0xffffffff806fa421 "/usr/src/sys/kern/vfs_subr.c", line=2138) > at vnode_if.h:859 > #7 0xffffffff8049b680 in vget (vp=0xffffff0358bbb760, flags=524544, > td=0xffffff0320de88c0) at /usr/src/sys/kern/vfs_subr.c:2138 > #8 0xffffffff8048d4aa in vfs_hash_get (mp=0xffffff004a3a0000, > hash=227722108, > flags=Variable "flags" is not available. > ) at /usr/src/sys/kern/vfs_hash.c:81 > #9 0xffffffff805631f6 in nfs_nget (mntp=0xffffff004a3a0000, > fhp=0xffffff03771eed56, fhsize=32, npp=0xffffff847ac10a40, > flags=524288) > at /usr/src/sys/nfsclient/nfs_node.c:120 > #10 0xffffffff80570229 in nfs_readdirplusrpc (vp=0xffffff0179902760, > uiop=0xffffff847ac10ad0, cred=0xffffff005587c300) > at /usr/src/sys/nfsclient/nfs_vnops.c:2636 > ---Type to continue, or q to quit--- > #11 0xffffffff8055f144 in nfs_doio (vp=0xffffff0179902760, > bp=0xffffff83e05c5860, cr=0xffffff005587c300, td=Variable "td" is not > available. > ) > at /usr/src/sys/nfsclient/nfs_bio.c:1600 > #12 0xffffffff8056770a in nfssvc_iod (instance=Variable "instance" is > not > available. > ) > at /usr/src/sys/nfsclient/nfs_nfsiod.c:303 > #13 0xffffffff803d0c2f in fork_exit (callout=0xffffffff805674b0 > , > arg=0xffffffff809266e0, frame=0xffffff847ac10c40) > at /usr/src/sys/kern/kern_fork.c:861 > > Thread stuck in getblk for that vnode (holds shared lock on this > vnode): > > (kgdb) proc 36902 > [Switching to thread 149 (Thread 101543)]#0 sched_switch ( > td=0xffffff0378d8a8c0, newtd=0xffffff0003521460, flags=Variable > "flags" is > not available. > ) > at /usr/src/sys/kern/sched_ule.c:1898 > 1898 cpuid = PCPU_GET(cpuid); > (kgdb) where > #0 sched_switch (td=0xffffff0378d8a8c0, newtd=0xffffff0003521460, > flags=Variable "flags" is not available. > ) > at /usr/src/sys/kern/sched_ule.c:1898 > #1 0xffffffff80407953 in mi_switch (flags=260, newtd=0x0) > at /usr/src/sys/kern/kern_synch.c:449 > #2 0xffffffff8043e342 in sleepq_wait (wchan=0xffffff83e11bc1c0, > pri=96) > at /usr/src/sys/kern/subr_sleepqueue.c:629 > #3 0xffffffff803e5755 in __lockmgr_args (lk=0xffffff83e11bc1c0, > flags=530688, > ilk=0xffffff0358bbb878, wmesg=Variable "wmesg" is not available. > ) at /usr/src/sys/kern/kern_lock.c:220 > #4 0xffffffff80483aeb in getblk (vp=0xffffff0358bbb760, blkno=3, > size=32768, > slpflag=0, slptimeo=0, flags=0) at lockmgr.h:94 > #5 0xffffffff8055e963 in nfs_getcacheblk (vp=0xffffff0358bbb760, bn=3, > size=32768, td=0xffffff0378d8a8c0) at > /usr/src/sys/nfsclient/nfs_bio.c:1259 > #6 0xffffffff805627d9 in nfs_bioread (vp=0xffffff0358bbb760, > uio=0xffffff847bcf0ad0, ioflag=Variable "ioflag" is not available. > ) at /usr/src/sys/nfsclient/nfs_bio.c:530 > #7 0xffffffff806956a4 in VOP_READ_APV (vop=0xffffffff808a29a0, > a=0xffffff847bcf09c0) at vnode_if.c:887 > #8 0xffffffff804a9e27 in vn_read (fp=0xffffff03b6a506e0, > uio=0xffffff847bcf0ad0, active_cred=Variable "active_cred" is not > available. > ) at vnode_if.h:384 > #9 0xffffffff80444fb1 in dofileread (td=0xffffff0378d8a8c0, fd=3, > fp=0xffffff03b6a506e0, auio=0xffffff847bcf0ad0, offset=Variable > "offset" > is not available. > ) at file.h:242 > > The buffer is locked by LK_KERNPROC: > > (kgdb) bprint bp > 0xffffff83e11bc128: BIO_READ flags (ASYNC|VMIO) > error = 0, bufsize = 32768, bcount = 32768, b_resid = 0 > bufobj = 0xffffff0358bbb878, data = 0xffffff8435a05000, blkno = d, dep > = 0x0 > lock type bufwait: EXCL by LK_KERNPROC with exclusive waiters pending > > And this buffer is queued as the first pending buffer on the mount > waiting > for service by nfsiod: > > (kgdb) set $nmp = (struct nfsmount *)vp->v_mount->mnt_data > (kgdb) p $nmp->nm_bufq.tqh_first > $24 = (struct buf *) 0xffffff83e11bc128 > (kgdb) p bp > $25 = (struct buf *) 0xffffff83e11bc128 > > So, the first process is waiting for a block from an NFS directory. > That > block is in queue to be completed as an async I/O by the nfsiod thread > pool. > However, the lone nfsiod thread in the pool is waiting to exclusively > lock the > original NFS directory to update its attributes, so it cannot service > the > async I/O request. > > -- > John Baldwin From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 07:35:04 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 5B654CB2 for ; Thu, 14 Mar 2013 07:35:04 +0000 (UTC) (envelope-from phantom@phantom.su) Received: from relay13.nicmail.ru (relay13.nicmail.ru [195.208.6.7]) by mx1.freebsd.org (Postfix) with ESMTP id E570D93F for ; Thu, 14 Mar 2013 07:35:03 +0000 (UTC) Received: from [109.70.25.119] (port=54701 helo=nicmail.ru) by f17.mail.nic.ru with esmtp (Exim 5.55) (envelope-from ) id 1UG2b9-000Kmb-5c for freebsd-fs@freebsd.org; Thu, 14 Mar 2013 11:29:11 +0400 Received: from [194.85.198.26] (account phantom@phantom.su HELO phantom-mobile.node) by fcgp05.nicmail.ru (CommuniGate Pro SMTP 5.2.3) with ESMTPSA id 178408978 for freebsd-fs@freebsd.org; Thu, 14 Mar 2013 11:29:11 +0400 Message-ID: <51417C47.8010304@phantom.su> Date: Thu, 14 Mar 2013 11:29:11 +0400 From: Noskov Ilia User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130215 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: should vn_fullpath1() ever return a path with "." in it? References: <1208475167.3432384.1362099531469.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <1208475167.3432384.1362099531469.JavaMail.root@erie.cs.uoguelph.ca> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: phantom@phantom.su List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 07:35:04 -0000 On 03/01/2013 04:58 AM, Rick Macklem wrote: > Kostik Belousov wrote: >> On Wed, Feb 27, 2013 at 09:59:22PM -0500, Rick Macklem wrote: >>> Hi, >>> >>> Sergey Kandaurov reported a problem where getcwd() returns a >>> path with "/./" imbedded in it for an NFSv4 mount. This is >>> caused by a mount point crossing on the server when at the >>> server's root because vn_fullpath1() uses VV_ROOT to spot >>> mount point crossings. >>> >>> The current workaround is to use the sysctls: >>> debug.disablegetcwd=1 >>> debug.disablefullpath=1 >>> >>> However, it would be nice to fix this when vn_fullpath1() >>> is being used. >>> >>> A simple fix is to have vn_fullpath1() fail when it finds >>> "." as a directory match in the path. When vn_fullpath1() >>> fails, the syscalls fail and that allows the libc algorithm >>> to be used (which works for this case because it doesn't >>> depend on VV_ROOT being set, etc). >>> >>> So, I am wondering if a patch (I have attached one) that >>> makes vn_fullpath1() fail when it matches "." will break >>> anything else? (I don't think so, since the code checks >>> for VV_ROOT in the loop above the check for a match of >>> ".", but I am not sure?) >>> >>> Thanks for any input w.r.t. this, rick >> >>> --- kern/vfs_cache.c.sav 2013-02-27 20:44:42.000000000 -0500 >>> +++ kern/vfs_cache.c 2013-02-27 21:10:39.000000000 -0500 >>> @@ -1333,6 +1333,20 @@ vn_fullpath1(struct thread *td, struct v >>> startvp, NULL, 0, 0); >>> break; >>> } >>> + if (buf[buflen] == '.' && (buf[buflen + 1] == '\0' || >>> + buf[buflen + 1] == '/')) { >>> + /* >>> + * Fail if it matched ".". This should only happen >>> + * for NFSv4 mounts that cross server mount points. >>> + */ >>> + CACHE_RUNLOCK(); >>> + vrele(vp); >>> + numfullpathfail1++; >>> + error = ENOENT; >>> + SDT_PROBE(vfs, namecache, fullpath, return, >>> + error, vp, NULL, 0, 0); >>> + break; >>> + } >>> buf[--buflen] = '/'; >>> slash_prefixed = 1; >>> } >> >> I do not quite understand this. Did the dvp (parent) vnode returned by >> VOP_VPTOCNP() equal to vp (child) vnode in the case of the "." name ? >> It must be, for the correct operation, but also it should cause the >> almost >> infinite loop in the vn_fullpath1(). The loop is not really infinite >> due >> to a limited size of the buffer where the infinite amount of "./" is >> placed. >> >> Anyway, I think we should do better than this patch, even if it is >> legitimate. I think that the better place to check the condition is >> the >> default implementation of VOP_VPTOCNP(). Am I right that this is where >> it broke for you ? >> >> diff --git a/sys/kern/vfs_default.c b/sys/kern/vfs_default.c >> index 00d064e..1dd0185 100644 >> --- a/sys/kern/vfs_default.c >> +++ b/sys/kern/vfs_default.c >> @@ -856,8 +856,12 @@ vop_stdvptocnp(struct vop_vptocnp_args *ap) >> error = ENOMEM; >> goto out; >> } >> - bcopy(dp->d_name, buf + i, dp->d_namlen); >> - error = 0; >> + if (dp->d_namlen == 1 && dp->d_name[0] == '.') { >> + error = ENOENT; >> + } else { >> + bcopy(dp->d_name, buf + i, dp->d_namlen); >> + error = 0; >> + } >> goto out; >> } >> } while (len > 0 || !eofflag); > > Yes, this patch fixes the problem too. If you think it is safe to > do this, I can commit the patch in mid-April. Maybe Sergey can > test it? > > Thanks yet again, rick Hi, Rick. Strange behavior on nfs-client after apply this patch: sysctl debug.disablecwd=0 sysctl debug.disablefullpath=0 # mount -v -t nfs 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid 02ff003a3a000000) # ls /home | wc -l 4946 # cd /home/user6308/.ro # time pwd /home/user6308/.ro 0.008u 0.269s 0:08.47 3.0% 4+157k 0+0io 0pf+0w # ktrace -t+ -i pwd ktrace.out is big (1MB). Attach or not? A small piece of trace: 19527 pwd CALL mmap(0,0x400000,0x3,0x1002,0xffffffff,0) 19527 pwd RET mmap 34376515584/0x801000000 19527 pwd CALL __getcwd(0x801006400,0x400) 19527 pwd NAMI ".." 19527 pwd NAMI ".." 19527 pwd RET __getcwd -1 errno 2 No such file or directory 19527 pwd CALL stat(0x800947a14,0x7fffffffd940) 19527 pwd NAMI "/" 19527 pwd STRU struct stat {dev=98, ino=2, mode=drwxr-xr-x , nlink=19, uid=0, gid=0, rdev=2120, atime=1363244893, stime=1362653279, ctime=1362653279, birthtime=1200836451, size=1024, blksize=16384, blocks=4, flags=0x0 } 19527 pwd RET stat 0 19527 pwd CALL lstat(0x80094779c,0x7fffffffd940) 19527 pwd NAMI "." 19527 pwd STRU struct stat {dev=1230702064, ino=145, mode=drwxr-xr-x , nlink=2, uid=0, gid=0, rdev=4294967295, atime=1363244672.246785874, stime=1363244792.864201338, ctime=1363244792.864201338, birthtime=-1, size=3, blksize=4096, blocks=3, flags=0x0 } 19527 pwd RET lstat 0 19527 pwd CALL openat(0xffffff9c,0x80094779b,0x100000,0x2) 19527 pwd NAMI ".." 19527 pwd RET openat 3 19527 pwd CALL fstat(0x3,0x7fffffffd880) 19527 pwd STRU struct stat {dev=1230702064, ino=4, mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, atime=1363244665.232140704, stime=1363010116.496298252, ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, blocks=3, flags=0x0 } 19527 pwd RET fstat 0 19527 pwd CALL fcntl(0x3,F_SETFD,FD_CLOEXEC) 19527 pwd RET fcntl 0 19527 pwd CALL fstatfs(0x3,0x7fffffffd660) 19527 pwd RET fstatfs 0 19527 pwd CALL fstat(0x3,0x7fffffffd940) 19527 pwd STRU struct stat {dev=1230702064, ino=4, mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, atime=1363244665.232140704, stime=1363010116.496298252, ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, blocks=3, flags=0x0 } 19527 pwd RET fstat 0 19527 pwd CALL getdirentries(0x3,0x801018000,0x1000,0x8010160a8) 19527 pwd RET getdirentries 4096/0x1000 19527 pwd CALL fstat(0x3,0x7fffffffd940) 19527 pwd STRU struct stat {dev=1230702064, ino=4, mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, atime=1363244665.232140704, stime=1363010116.496298252, ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, blocks=3, flags=0x0 } 19527 pwd RET fstat 0 19527 pwd CALL openat(0x3,0x80094779b,0x100000,0) 19527 pwd NAMI ".." 19527 pwd RET openat 4 [..............................] 19527 pwd CALL madvise(0x801016000,0x1000,MADV_FREE) 19527 pwd RET madvise 0 19527 pwd CALL madvise(0x801018000,0x2000,MADV_FREE) 19527 pwd RET madvise 0 19527 pwd CALL close(0x3) 19527 pwd RET close 0 19527 pwd CALL fstat(0x4,0x7fffffffd880) 19527 pwd STRU struct stat {dev=973143810, ino=4, mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295, atime=1363244767.460164771, stime=1363172100.380266923, ctime=1363172100.380266923, birthtime=-1, size=4948, blksize=4096, blocks=713, flags=0x0 } 19527 pwd RET fstat 0 19527 pwd CALL fcntl(0x4,F_SETFD,FD_CLOEXEC) 19527 pwd RET fcntl 0 19527 pwd CALL fstatfs(0x4,0x7fffffffd660) 19527 pwd RET fstatfs 0 19527 pwd CALL fstat(0x4,0x7fffffffd940) 19527 pwd STRU struct stat {dev=973143810, ino=4, mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295, atime=1363244767.460164771, stime=1363172100.380266923, ctime=1363172100.380266923, birthtime=-1, size=4948, blksize=4096, blocks=713, flags=0x0 } 19527 pwd RET fstat 0 19527 pwd CALL getdirentries(0x4,0x801018000,0x1000,0x8010160a8) 19527 pwd RET getdirentries 4096/0x1000 19527 pwd CALL fstatat(0x4,0x801018030,0x7fffffffd940,0x200) 19527 pwd NAMI "user6158" 19527 pwd STRU struct stat {dev=1774902232, ino=4, mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, atime=1363009687.040357529, stime=1363010116.496298252, ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, blocks=3, flags=0x0 } 19527 pwd RET fstatat 0 19527 pwd CALL fstatat(0x4,0x80101804c,0x7fffffffd940,0x200) 19527 pwd NAMI "user2289" 19527 pwd STRU struct stat {dev=1988229825, ino=4, mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, atime=1363009687.040357529, stime=1363010116.496298252, ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, blocks=3, flags=0x0 } 19527 pwd RET fstatat 0 19527 pwd CALL fstatat(0x4,0x801018068,0x7fffffffd940,0x200) 19527 pwd NAMI "user4761" 19527 pwd STRU struct stat {dev=2438657130, ino=4, mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, atime=1363009687.040357529, stime=1363010116.496298252, ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, blocks=3, flags=0x0 } 19527 pwd RET fstatat 0 19527 pwd CALL fstatat(0x4,0x801018084,0x7fffffffd940,0x200) 19527 pwd NAMI "user6055" [.........................................] and next get stat of all directories in /home > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > -- Best Regards, Ilia Noskov Regional Network Information Center (RU-CENTER) phone: +7 495 737-0601 fax: +7 495 737-0602 http://www.nic.ru From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 09:09:01 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 0C715441 for ; Thu, 14 Mar 2013 09:09:01 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 77016D70 for ; Thu, 14 Mar 2013 09:09:00 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r2E98oUH035939; Thu, 14 Mar 2013 11:08:50 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.0 kib.kiev.ua r2E98oUH035939 Received: (from kostik@localhost) by tom.home (8.14.6/8.14.6/Submit) id r2E98l4X035938; Thu, 14 Mar 2013 11:08:47 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 14 Mar 2013 11:08:47 +0200 From: Konstantin Belousov To: Noskov Ilia Subject: Re: should vn_fullpath1() ever return a path with "." in it? Message-ID: <20130314090847.GH3794@kib.kiev.ua> References: <1208475167.3432384.1362099531469.JavaMail.root@erie.cs.uoguelph.ca> <51417C47.8010304@phantom.su> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="0y/VvS0T6GrSShsQ" Content-Disposition: inline In-Reply-To: <51417C47.8010304@phantom.su> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 09:09:01 -0000 --0y/VvS0T6GrSShsQ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 14, 2013 at 11:29:11AM +0400, Noskov Ilia wrote: > Strange behavior on nfs-client after apply this patch: >=20 > sysctl debug.disablecwd=3D0 > sysctl debug.disablefullpath=3D0 >=20 > # mount -v -t nfs > 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid=20 > 02ff003a3a000000) > # ls /home | wc -l > 4946 > # cd /home/user6308/.ro > # time pwd > /home/user6308/.ro > 0.008u 0.269s 0:08.47 3.0% 4+157k 0+0io 0pf+0w > # ktrace -t+ -i pwd >=20 >=20 > ktrace.out is big (1MB). Attach or not? >=20 >=20 >=20 > A small piece of trace: > 19527 pwd CALL=20 > mmap(0,0x400000,0x3,0x1002,0x= ffffffff,0) > 19527 pwd RET mmap 34376515584/0x801000000 > 19527 pwd CALL __getcwd(0x801006400,0x400) > 19527 pwd NAMI ".." > 19527 pwd NAMI ".." > 19527 pwd RET __getcwd -1 errno 2 No such file or directory > 19527 pwd CALL stat(0x800947a14,0x7fffffffd940) > 19527 pwd NAMI "/" > 19527 pwd STRU struct stat {dev=3D98, ino=3D2, mode=3Ddrwxr-xr-x = ,=20 > nlink=3D19, uid=3D0, gid=3D0, rdev=3D2120, atime=3D1363244893, stime=3D13= 62653279,=20 > ctime=3D1362653279, birthtime=3D1200836451, size=3D1024, blksize=3D16384,= =20 > blocks=3D4, flags=3D0x0 } > 19527 pwd RET stat 0 > 19527 pwd CALL lstat(0x80094779c,0x7fffffffd940) > 19527 pwd NAMI "." > 19527 pwd STRU struct stat {dev=3D1230702064, ino=3D145,=20 > mode=3Ddrwxr-xr-x , nlink=3D2, uid=3D0, gid=3D0, rdev=3D4294967295,=20 > atime=3D1363244672.246785874, stime=3D1363244792.864201338,=20 > ctime=3D1363244792.864201338, birthtime=3D-1, size=3D3, blksize=3D4096,= =20 > blocks=3D3, flags=3D0x0 } > 19527 pwd RET lstat 0 > 19527 pwd CALL openat(0xffffff9c,0x80094779b,0x100000,0x2) > 19527 pwd NAMI ".." > 19527 pwd RET openat 3 > 19527 pwd CALL fstat(0x3,0x7fffffffd880) > 19527 pwd STRU struct stat {dev=3D1230702064, ino=3D4,=20 > mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,=20 > atime=3D1363244665.232140704, stime=3D1363010116.496298252,=20 > ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D4096,= =20 > blocks=3D3, flags=3D0x0 } > 19527 pwd RET fstat 0 > 19527 pwd CALL fcntl(0x3,F_SETFD,FD_CLOEXEC) > 19527 pwd RET fcntl 0 > 19527 pwd CALL fstatfs(0x3,0x7fffffffd660) > 19527 pwd RET fstatfs 0 > 19527 pwd CALL fstat(0x3,0x7fffffffd940) > 19527 pwd STRU struct stat {dev=3D1230702064, ino=3D4,=20 > mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,=20 > atime=3D1363244665.232140704, stime=3D1363010116.496298252,=20 > ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D4096,= =20 > blocks=3D3, flags=3D0x0 } > 19527 pwd RET fstat 0 > 19527 pwd CALL getdirentries(0x3,0x801018000,0x1000,0x8010160a8) > 19527 pwd RET getdirentries 4096/0x1000 > 19527 pwd CALL fstat(0x3,0x7fffffffd940) > 19527 pwd STRU struct stat {dev=3D1230702064, ino=3D4,=20 > mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,=20 > atime=3D1363244665.232140704, stime=3D1363010116.496298252,=20 > ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D4096,= =20 > blocks=3D3, flags=3D0x0 } > 19527 pwd RET fstat 0 > 19527 pwd CALL openat(0x3,0x80094779b,0x100000,0) > 19527 pwd NAMI ".." > 19527 pwd RET openat 4 > [..............................] > 19527 pwd CALL madvise(0x801016000,0x1000,MADV_FREE) > 19527 pwd RET madvise 0 > 19527 pwd CALL madvise(0x801018000,0x2000,MADV_FREE) > 19527 pwd RET madvise 0 > 19527 pwd CALL close(0x3) > 19527 pwd RET close 0 > 19527 pwd CALL fstat(0x4,0x7fffffffd880) > 19527 pwd STRU struct stat {dev=3D973143810, ino=3D4,=20 > mode=3Ddrwxr-xr-x , nlink=3D4948, uid=3D0, gid=3D0, rdev=3D4294967295,=20 > atime=3D1363244767.460164771, stime=3D1363172100.380266923,=20 > ctime=3D1363172100.380266923, birthtime=3D-1, size=3D4948, blksize=3D4096= ,=20 > blocks=3D713, flags=3D0x0 } > 19527 pwd RET fstat 0 > 19527 pwd CALL fcntl(0x4,F_SETFD,FD_CLOEXEC) > 19527 pwd RET fcntl 0 > 19527 pwd CALL fstatfs(0x4,0x7fffffffd660) > 19527 pwd RET fstatfs 0 > 19527 pwd CALL fstat(0x4,0x7fffffffd940) > 19527 pwd STRU struct stat {dev=3D973143810, ino=3D4,=20 > mode=3Ddrwxr-xr-x , nlink=3D4948, uid=3D0, gid=3D0, rdev=3D4294967295,=20 > atime=3D1363244767.460164771, stime=3D1363172100.380266923,=20 > ctime=3D1363172100.380266923, birthtime=3D-1, size=3D4948, blksize=3D4096= ,=20 > blocks=3D713, flags=3D0x0 } > 19527 pwd RET fstat 0 > 19527 pwd CALL getdirentries(0x4,0x801018000,0x1000,0x8010160a8) > 19527 pwd RET getdirentries 4096/0x1000 > 19527 pwd CALL fstatat(0x4,0x801018030,0x7fffffffd940,0x200) > 19527 pwd NAMI "user6158" > 19527 pwd STRU struct stat {dev=3D1774902232, ino=3D4,=20 > mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,=20 > atime=3D1363009687.040357529, stime=3D1363010116.496298252,=20 > ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D4096,= =20 > blocks=3D3, flags=3D0x0 } > 19527 pwd RET fstatat 0 > 19527 pwd CALL fstatat(0x4,0x80101804c,0x7fffffffd940,0x200) > 19527 pwd NAMI "user2289" > 19527 pwd STRU struct stat {dev=3D1988229825, ino=3D4,=20 > mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,=20 > atime=3D1363009687.040357529, stime=3D1363010116.496298252,=20 > ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D4096,= =20 > blocks=3D3, flags=3D0x0 } > 19527 pwd RET fstatat 0 > 19527 pwd CALL fstatat(0x4,0x801018068,0x7fffffffd940,0x200) > 19527 pwd NAMI "user4761" > 19527 pwd STRU struct stat {dev=3D2438657130, ino=3D4,=20 > mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295,=20 > atime=3D1363009687.040357529, stime=3D1363010116.496298252,=20 > ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D4096,= =20 > blocks=3D3, flags=3D0x0 } > 19527 pwd RET fstatat 0 > 19527 pwd CALL fstatat(0x4,0x801018084,0x7fffffffd940,0x200) > 19527 pwd NAMI "user6055" > [.........................................] >=20 > and next get stat of all directories in /home Slightly different version of the patch was committed as r247560. The situation could only happen if the parent directory contains the "." entry with inode number equal to the inode number of the subdirectory. Can you confirm that this is your case ? --0y/VvS0T6GrSShsQ Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJRQZOfAAoJEJDCuSvBvK1Bu9IP/iRFexlxYkdp5x5FeZuzAt7A 3Bu4LKb8V3eaMEniTOIRQff8eO8GsA3Ti3s9F3ZHN6BiCriy5DrWeuxNJtAa6YmA GnU7r16aROpzJ8Z0CtTQoabKPNjCdqy9LG6jeiFlhemaG2tgpz2N6BX/MuJbrW0F CJIsw+QozZ/koqgXpFwvkb+kjXydGm41YZJxEkdtrIecampWzWunJD19cjD3sSrC rEYggRinQK/EQUrRBIidxWu9qKrW+WrZy4ePP7jjuIjr9//vYXIQnJre+BiYdqjL Ihwi5fJ91wOU1vNr/7VEN/MOyISqZjXkINdvtKOWWClX/ahC06JcWoPECNVM8S3/ F3eUNyECyxkSKNnHTRseAqVZBOUpE7ulr6fSxxGiVB1SCYZwmfXq7IODV0mazDI9 w/03KBCSca0HScX5j2cYrKooS9tthdZiAp2f4LRfHef1fnaesInfDmDHkff4pxs4 QsjdlYLHe/ke0XFzZsR8zcdtdX6HmdzcCJLjMERvEyZ+8KPqb75/ANZb4aLK+UE1 FUYec0QDJ/DsPqbucMbOr4uXiZHHGrwYP4yETATlh9QTxLBZSSNgK50qCEvTf1ek metx0YXNscOtTg+ZFY5qJtu4g8J/dMkWlUExzQ7FYIN9vfaGH5TRlfq+Qd0iGuRL nluNA4JJ4YVEg27W1ZbP =+vHn -----END PGP SIGNATURE----- --0y/VvS0T6GrSShsQ-- From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 09:27:37 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id C61645E9; Thu, 14 Mar 2013 09:27:37 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 213F4E0B; Thu, 14 Mar 2013 09:27:36 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r2E9RSba039426; Thu, 14 Mar 2013 11:27:28 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.0 kib.kiev.ua r2E9RSba039426 Received: (from kostik@localhost) by tom.home (8.14.6/8.14.6/Submit) id r2E9RS7p039425; Thu, 14 Mar 2013 11:27:28 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 14 Mar 2013 11:27:28 +0200 From: Konstantin Belousov To: Rick Macklem Subject: Re: Deadlock in the NFS client Message-ID: <20130314092728.GI3794@kib.kiev.ua> References: <201303131356.37919.jhb@freebsd.org> <492562517.3880600.1363217615412.JavaMail.root@erie.cs.uoguelph.ca> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="A+YNfBJfL1GjoVjN" Content-Disposition: inline In-Reply-To: <492562517.3880600.1363217615412.JavaMail.root@erie.cs.uoguelph.ca> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: Rick Macklem , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 09:27:37 -0000 --A+YNfBJfL1GjoVjN Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote: > John Baldwin wrote: > > I ran into a machine that had a deadlock among certain files on a > > given NFS > > mount today. I'm not sure how best to resolve it, though it seems like > > perhaps there is a bug with how the pool of nfsiod threads is managed. > > Anyway, more details on the actual hang below. This was on 8.x with > > the > > old NFS client, but I don't see anything in HEAD that would fix this. > >=20 > > First note that the system was idle so it had dropped down to only one > > nfsiod thread. > >=20 > Hmm, I see the problem and I'm a bit surprised it doesn't bite more often. > It seems to me that this snippet of code from nfs_asyncio() makes too > weak an assumption: > /* > * If none are free, we may already have an iod working on this mount > * point. If so, it will process our request. > */ > if (!gotiod) { > if (nmp->nm_bufqiods > 0) { > NFS_DPF(ASYNCIO, > ("nfs_asyncio: %d iods are already processing mount %p\n", > nmp->nm_bufqiods, nmp)); > gotiod =3D TRUE; > } > } > It assumes that, since an nfsiod thread is processing some buffer for the > mount, it will become available to do this one, which isn't true for your > deadlock. >=20 > I think the simple fix would be to recode nfs_asyncio() so that > it only returns 0 if it finds an AVAILABLE nfsiod thread that it > has assigned to do the I/O, getting rid of the above. The problem > with doing this is that it may result in a lot more synchronous I/O > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe more > synchronous I/O could be avoided by allowing nfs_asyncio() to create a > new thread even if the total is above nfs_iodmax. (I think this would > require the fixed array to be replaced with a linked list and might > result in a large number of nfsiod threads.) Maybe just having a large > nfs_iodmax would be an adequate compromise? > > Does having a large # of nfsiod threads cause any serious problem for > most systems these days? > > I'd be tempted to recode nfs_asyncio() as above and then, instead > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed number of > nfsiod threads (this could be a tunable, with the understanding that > it should be large for good performance) > I do not see how this would solve the deadlock itself. The proposal would only allow system to survive slightly longer after the deadlock appeared. And, I think that allowing the unbound amount of nfsiod threads is also fatal. The issue there is the LOR between buffer lock and vnode lock. Buffer lock always must come after the vnode lock. The problematic nfsiod thread, which locks the vnode, volatile this rule, because despite the LK_KERNPROC ownership of the buffer lock, it is the thread which de fact owns the buffer (only the thread can unlock it). A possible solution would be to pass LK_NOWAIT to nfs_nget() from the nfs_readdirplusrpc(). From my reading of the code, nfs_nget() should be capable of correctly handling the lock failure. And EBUSY would result in doit =3D 0, which should be fine too. It is possible that EBUSY should be reset to 0, though. --A+YNfBJfL1GjoVjN Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJRQZf/AAoJEJDCuSvBvK1BK+QP/019GONiMGOEZgy9jnRFk2aR 8hUfCDdJLiNy4e3Wa2gw89Yr9TGSiOaa6YRLKwSWQ9I39yFPOOeoM6kC/QD0oQqu qlxdalKYbiJOR61ufnqIQCRsDufbKPD2IfkoTzEYiPCsZLEAu+yV0c/0g09mCMb5 +KIr9ku72CVkLba1BHBA+9CiAb1VFa234iQDc+t792e62ttPJPP7xhTylNaME3Y7 QWqFZjcG6PFfeQDOVkhWUGRO4m6Ak5peEpLXE1po0+sgfcnrZmgw4crgLzmIKKZl vdQ3UetqWflaTCnP3L9B28j0+H/CS53VS9sndST8xYXPADMlnuoLLGkiBpsrfRMj vQZHz7sV6+qNXxN2LZJBgHQPuio4zghyxP6+4j57BmCJfWm6gR2pqcHip9xDBI6j hXbkL1nVPWRH6iIIeRVs9RWXrwaa+upNIX9+aSgKWRXitIH3gRL7Gjg57wI6wxaI CznTVt8geuVz4C1jtNcR0BfZ5i0zjwBH66hLcX86HvfYGY26HmQPB3kOTfmaYpdp KIhYxof5gEbyr3Zgl0MwEhAxLHbVyLBybkz6ENMtbBDOhZJ2pGFawIx7j9ngK+9D g282I/j4S8mGPdSI4uNFPLPuGyYMcjUFTBZwDostsw1pzCy+QCN3gZQWuWdVfIYe ZrYLzGYi4hGE22kBdBNV =YA8y -----END PGP SIGNATURE----- --A+YNfBJfL1GjoVjN-- From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 09:41:34 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 2901B8FF for ; Thu, 14 Mar 2013 09:41:34 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail13.syd.optusnet.com.au (mail13.syd.optusnet.com.au [211.29.132.194]) by mx1.freebsd.org (Postfix) with ESMTP id 8FF9CE93 for ; Thu, 14 Mar 2013 09:41:33 +0000 (UTC) Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106]) by mail13.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r2E9fIVt003177 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 14 Mar 2013 20:41:21 +1100 Date: Thu, 14 Mar 2013 20:41:18 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Cody Ritts Subject: Re: Aligning MBR for ZFS boot help In-Reply-To: <5140F373.1010907@caltel.com> Message-ID: <20130314195715.Y909@besplex.bde.org> References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> <1362951595.99445.2.camel@btw.pki2.com> <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org> <513F8F04.60206@caltel.com> <20130313232247.B1078@besplex.bde.org> <5140F373.1010907@caltel.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=JMpjKL2b c=1 sm=1 a=u3bVZBOdoLwA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=cUKNXEIY390A:10 a=s9DMC9WtY5DSvmTa96MA:9 a=CjuIK1q_8ugA:10 a=N-DlQLPxIT-iTUNV:21 a=8ReUPQTKasrRMr7F:21 a=TEtd8y5WR3g2ypngnwZWYw==:117 Cc: freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 09:41:34 -0000 On Wed, 13 Mar 2013, Cody Ritts wrote: > So, if I now want to create an aligned single partition, here are the steps I > think I should be taking: > > Sectors should be < 64 > Heads should be < 256 > for OLD OLD stuff, cylinders should be < 1024 No. Sectors _must_ be < 64 Heads _must_ be < 257 Heads should be < 256 for OLD OLD stuff, cylinders should be < 1024 > if you want boundaries on a power of 2, those the number of sectors and heads > should also be a power of 2. > > So, would all of these be potential valid values? > > s32 h128 > 512*32*128 = 2097152B = 2MB cylinder > s32 h64 > 512*32*64 = 1048576B = 1MB cylinder > s16 h128 > 512*16*128 = 1048576B = 1MB cylinder > s4 h8 h4 > 512*4*4 = 8192B = 8K cylinder Yes, provided the BIOS agrees. If the BIOS config says that CHS is something or other, better use or change that. Also, s4 h8 would probably cause problems with older BIOSes if they actually use CHS, by causing cylinder numbers to exceed 64K. 64K cylinders of size 8KB is just 512MB. BIOSes might use 16-bit cylinder numbers for translating from CHS to linear even if they don't really use CHS. > I am assuming that once I know my cylinder size, I just divide the total size > of my hard drive to come up with cylinder count? Usually I don't bother changing the cylinder count when I change the number of heads and sectors, since I know that it is not really used by fdisk (except possibly for default partition sizes which I never use). > s4 h8 > 64023257088 / 8192 = 7815339c > (8k is the largest power of 2 that the drive will evenly divide into) > > s32 h64 > 64023257088 / 1048576 = 61057.3359375 > Round down to 61057. > (does the cylinder need to end on the end of the disk?) If the sector count were used, then it should be set to the rounded down value. It is usually safe to make the last partition end at the of the disk and not at the end of the fake cylinder given by rounding. I sometimes use the part beyond the end of the fake cylinder and for a normal partition and sometimes leave it free. > So, here is what i calculated: > c61057 h64 s32 > > I want an offset of 2M, file system should be reduced to 61055M > (61055 * 1024 * 1024)/512 = 125040640s) > > Here are the commands that I ran: > >> cat << EOF > command >> g c61057 h64 s32 >> p 1 0xa5 4096 125040640 >> a 1 >> EOF >> root@:/root # fdisk -f command ada0 >> ******* Working on device /dev/ada0 ******* >> fdisk: WARNING line 1: number of cylinders (61057) may be out-of-range >> (must be within 1-1024 for normal BIOS operation, unless the entire >> disk >> is dedicated to FreeBSD) >> root@:/root # fdisk -p ada0 >> # /dev/ada0 >> g c124053 h16 s63 >> p 1 0xa5 4096 125040640 >> a 1 > note, it auto goes back when exporting Seems reasonable, but I didn't check the details. Anything that avoids the message about the automatic broken adjustment is probably OK. >> root@:/root # gpart show ada0 >> => 63 125045361 ada0 MBR (59G) >> 63 4033 - free - (2M) >> 4096 125040640 1 freebsd [active] (59G) >> 125044736 688 - free - (344k) >> root@:/root # gpart delete -i 1 ada0 >> root@:/root # gpart add -t freebsd -b 4096 -s 125040640 ada0 >> ada0s1 added >> root@:/root # gpart show ada0 >> => 63 125045361 ada0 MBR (59G) >> 63 4095 - free - (2M) >> 4158 125040573 1 freebsd (59G) >> 125044731 693 - free - (346k) > gpart does not care I don't know anything about gpart, and if it always does the wrong adjustment then I don't want to know. >> root@:/root # fdisk -f command ada0 >> ******* Working on device /dev/ada0 ******* >> fdisk: WARNING line 1: number of cylinders (61057) may be out-of-range >> (must be within 1-1024 for normal BIOS operation, unless the entire >> disk >> is dedicated to FreeBSD) >> root@:/root # fdisk ada0 >> ******* Working on device /dev/ada0 ******* >> parameters extracted from in-core disklabel are: >> cylinders=124053 heads=16 sectors/track=63 (1008 blks/cyl) >> >> Figures below won't work with BIOS for partitions not in cyl 1 >> parameters to be used for BIOS calculations are: >> cylinders=124053 heads=16 sectors/track=63 (1008 blks/cyl) It regressed to the broken default as usual. This is dangerous when modifying partition tables that have a geometry differing from the default. You can sill edit everything without changing the geometry if you are careful. >> Media sector size is 512 >> Warning: BIOS sector numbering starts with sector 1 >> Information from DOS bootblock is: >> The data for partition 1 is: >> sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) >> start 4096, size 125040640 (61055 Meg), flag 80 (active) >> beg: cyl 2/ head 0/ sector 1; >> end: cyl 640/ head 63/ sector 32 E.g., suppose you just want to change the sysid here. Type it in. Accept the defaults for the start and size so that these don't change. Then fdisk will default to making a mess of the CHS values (if the default is wrong). This can be recovered from by typing in all the old values. >> The data for partition 2 is: >> >> The data for partition 3 is: >> >> The data for partition 4 is: >> > > So, setting the geom simply does this: >>> sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) >>> start 4096, size 125040640 (61055 Meg), flag 80 (active) >>> beg: cyl 2/ head 0/ sector 1; >>> end: cyl 640/ head 63/ sector 32 > > I cannot set geom in my bios, nor does not show me what it thinks geom is. > Obviously anything that only supports 1024 cylinders will not think it is > very funny. Probabablyy many newer BIOSes do this. > I feel like I am missing some part of this puzzle, or is that all there is to > this to correct geom for proper alignment on an MBR? I don't like the looks of gpart, but it has a -a option for alignment. > So, by setting those CHS values I am: > making the partition table more compatible with other operating systems and > BIOSes? > and giving some utilities the CHS stuff they need to function right? It's not completely clear that S=32 H=64 is portable, but it is what most old SCSI BIOSes used. Also, if the disk already has some partitions with a certain geometry, use the same geometry for other partitions and don't use fdisk's defaults if they differ. Bruce From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 10:20:49 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 7361C2D1 for ; Thu, 14 Mar 2013 10:20:49 +0000 (UTC) (envelope-from phantom@phantom.su) Received: from relay08.nicmail.ru (relay08.nicmail.ru [195.208.6.4]) by mx1.freebsd.org (Postfix) with ESMTP id C73EDC3 for ; Thu, 14 Mar 2013 10:20:48 +0000 (UTC) Received: from [109.70.25.145] (port=43055 helo=nicmail.ru) by f06.mail.nic.ru with esmtp (Exim 5.55) (envelope-from ) id 1UG57D-000MaL-01; Thu, 14 Mar 2013 14:10:27 +0400 Received: from [194.85.198.26] (account phantom@phantom.su HELO phantom-mobile.node) by fcgp09.nicmail.ru (CommuniGate Pro SMTP 5.2.3) with ESMTPSA id 99207934; Thu, 14 Mar 2013 14:10:26 +0400 Message-ID: <5141A212.9050909@phantom.su> Date: Thu, 14 Mar 2013 14:10:26 +0400 From: Ilia Noskov User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130215 Thunderbird/17.0.3 MIME-Version: 1.0 To: Konstantin Belousov Subject: Re: should vn_fullpath1() ever return a path with "." in it? References: <1208475167.3432384.1362099531469.JavaMail.root@erie.cs.uoguelph.ca> <51417C47.8010304@phantom.su> <20130314090847.GH3794@kib.kiev.ua> In-Reply-To: <20130314090847.GH3794@kib.kiev.ua> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: phantom@phantom.su List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 10:20:49 -0000 On 03/14/2013 01:08 PM, Konstantin Belousov wrote: > On Thu, Mar 14, 2013 at 11:29:11AM +0400, Noskov Ilia wrote: >> Strange behavior on nfs-client after apply this patch: >> >> sysctl debug.disablecwd=0 >> sysctl debug.disablefullpath=0 >> >> # mount -v -t nfs >> 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid >> 02ff003a3a000000) >> # ls /home | wc -l >> 4946 >> # cd /home/user6308/.ro >> # time pwd >> /home/user6308/.ro >> 0.008u 0.269s 0:08.47 3.0% 4+157k 0+0io 0pf+0w >> # ktrace -t+ -i pwd >> >> >> ktrace.out is big (1MB). Attach or not? >> >> >> >> A small piece of trace: >> 19527 pwd CALL >> mmap(0,0x400000,0x3,0x1002,0xffffffff,0) >> 19527 pwd RET mmap 34376515584/0x801000000 >> 19527 pwd CALL __getcwd(0x801006400,0x400) >> 19527 pwd NAMI ".." >> 19527 pwd NAMI ".." >> 19527 pwd RET __getcwd -1 errno 2 No such file or directory >> 19527 pwd CALL stat(0x800947a14,0x7fffffffd940) >> 19527 pwd NAMI "/" >> 19527 pwd STRU struct stat {dev=98, ino=2, mode=drwxr-xr-x , >> nlink=19, uid=0, gid=0, rdev=2120, atime=1363244893, stime=1362653279, >> ctime=1362653279, birthtime=1200836451, size=1024, blksize=16384, >> blocks=4, flags=0x0 } >> 19527 pwd RET stat 0 >> 19527 pwd CALL lstat(0x80094779c,0x7fffffffd940) >> 19527 pwd NAMI "." >> 19527 pwd STRU struct stat {dev=1230702064, ino=145, >> mode=drwxr-xr-x , nlink=2, uid=0, gid=0, rdev=4294967295, >> atime=1363244672.246785874, stime=1363244792.864201338, >> ctime=1363244792.864201338, birthtime=-1, size=3, blksize=4096, >> blocks=3, flags=0x0 } >> 19527 pwd RET lstat 0 >> 19527 pwd CALL openat(0xffffff9c,0x80094779b,0x100000,0x2) >> 19527 pwd NAMI ".." >> 19527 pwd RET openat 3 >> 19527 pwd CALL fstat(0x3,0x7fffffffd880) >> 19527 pwd STRU struct stat {dev=1230702064, ino=4, >> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, >> atime=1363244665.232140704, stime=1363010116.496298252, >> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, >> blocks=3, flags=0x0 } >> 19527 pwd RET fstat 0 >> 19527 pwd CALL fcntl(0x3,F_SETFD,FD_CLOEXEC) >> 19527 pwd RET fcntl 0 >> 19527 pwd CALL fstatfs(0x3,0x7fffffffd660) >> 19527 pwd RET fstatfs 0 >> 19527 pwd CALL fstat(0x3,0x7fffffffd940) >> 19527 pwd STRU struct stat {dev=1230702064, ino=4, >> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, >> atime=1363244665.232140704, stime=1363010116.496298252, >> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, >> blocks=3, flags=0x0 } >> 19527 pwd RET fstat 0 >> 19527 pwd CALL getdirentries(0x3,0x801018000,0x1000,0x8010160a8) >> 19527 pwd RET getdirentries 4096/0x1000 >> 19527 pwd CALL fstat(0x3,0x7fffffffd940) >> 19527 pwd STRU struct stat {dev=1230702064, ino=4, >> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, >> atime=1363244665.232140704, stime=1363010116.496298252, >> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, >> blocks=3, flags=0x0 } >> 19527 pwd RET fstat 0 >> 19527 pwd CALL openat(0x3,0x80094779b,0x100000,0) >> 19527 pwd NAMI ".." >> 19527 pwd RET openat 4 >> [..............................] >> 19527 pwd CALL madvise(0x801016000,0x1000,MADV_FREE) >> 19527 pwd RET madvise 0 >> 19527 pwd CALL madvise(0x801018000,0x2000,MADV_FREE) >> 19527 pwd RET madvise 0 >> 19527 pwd CALL close(0x3) >> 19527 pwd RET close 0 >> 19527 pwd CALL fstat(0x4,0x7fffffffd880) >> 19527 pwd STRU struct stat {dev=973143810, ino=4, >> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295, >> atime=1363244767.460164771, stime=1363172100.380266923, >> ctime=1363172100.380266923, birthtime=-1, size=4948, blksize=4096, >> blocks=713, flags=0x0 } >> 19527 pwd RET fstat 0 >> 19527 pwd CALL fcntl(0x4,F_SETFD,FD_CLOEXEC) >> 19527 pwd RET fcntl 0 >> 19527 pwd CALL fstatfs(0x4,0x7fffffffd660) >> 19527 pwd RET fstatfs 0 >> 19527 pwd CALL fstat(0x4,0x7fffffffd940) >> 19527 pwd STRU struct stat {dev=973143810, ino=4, >> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295, >> atime=1363244767.460164771, stime=1363172100.380266923, >> ctime=1363172100.380266923, birthtime=-1, size=4948, blksize=4096, >> blocks=713, flags=0x0 } >> 19527 pwd RET fstat 0 >> 19527 pwd CALL getdirentries(0x4,0x801018000,0x1000,0x8010160a8) >> 19527 pwd RET getdirentries 4096/0x1000 >> 19527 pwd CALL fstatat(0x4,0x801018030,0x7fffffffd940,0x200) >> 19527 pwd NAMI "user6158" >> 19527 pwd STRU struct stat {dev=1774902232, ino=4, >> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, >> atime=1363009687.040357529, stime=1363010116.496298252, >> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, >> blocks=3, flags=0x0 } >> 19527 pwd RET fstatat 0 >> 19527 pwd CALL fstatat(0x4,0x80101804c,0x7fffffffd940,0x200) >> 19527 pwd NAMI "user2289" >> 19527 pwd STRU struct stat {dev=1988229825, ino=4, >> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, >> atime=1363009687.040357529, stime=1363010116.496298252, >> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, >> blocks=3, flags=0x0 } >> 19527 pwd RET fstatat 0 >> 19527 pwd CALL fstatat(0x4,0x801018068,0x7fffffffd940,0x200) >> 19527 pwd NAMI "user4761" >> 19527 pwd STRU struct stat {dev=2438657130, ino=4, >> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, >> atime=1363009687.040357529, stime=1363010116.496298252, >> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, >> blocks=3, flags=0x0 } >> 19527 pwd RET fstatat 0 >> 19527 pwd CALL fstatat(0x4,0x801018084,0x7fffffffd940,0x200) >> 19527 pwd NAMI "user6055" >> [.........................................] >> >> and next get stat of all directories in /home > > Slightly different version of the patch was committed as r247560. > > The situation could only happen if the parent directory contains the "." > entry with inode number equal to the inode number of the subdirectory. > Can you confirm that this is your case ? > Yes, it is. I'll try again on the latest snapshot. Thanks! -- Best Regards, Ilia Noskov Regional Network Information Center (RU-CENTER) phone: +7 495 737-0601 fax: +7 495 737-0602 http://www.nic.ru From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 11:47:20 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 51F58599 for ; Thu, 14 Mar 2013 11:47:20 +0000 (UTC) (envelope-from peter.maloney@brockmann-consult.de) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.171]) by mx1.freebsd.org (Postfix) with ESMTP id D683D6DA for ; Thu, 14 Mar 2013 11:47:19 +0000 (UTC) Received: from [10.3.0.26] ([141.4.215.32]) by mrelayeu.kundenserver.de (node=mreu4) with ESMTP (Nemesis) id 0M31EZ-1UZ1BQ27yp-00sxHw; Thu, 14 Mar 2013 12:47:03 +0100 Message-ID: <5141B8B6.4010209@brockmann-consult.de> Date: Thu, 14 Mar 2013 12:47:02 +0100 From: Peter Maloney User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Bruce Evans Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> <1362951595.99445.2.camel@btw.pki2.com> <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org> <513F8F04.60206@caltel.com> <20130313232247.B1078@besplex.bde.org> <5140F373.1010907@caltel.com> <20130314195715.Y909@besplex.bde.org> In-Reply-To: <20130314195715.Y909@besplex.bde.org> X-Enigmail-Version: 1.5 X-Provags-ID: V02:K0:C4/pe3TKhgPHEqVsWCFKHk7t77FrJqAAOpRdszHRylk Oze/q0yot56nEVdpSobrAswTpXLbeJBzppXtzLCFrn1ehj8H/k qU9v3JX0tM3hERgT7hq+5h3KBKTA0SBzPmsIxfgfoEU4Icm2VX BJNRTZLRCYCxX6cJl/6Bp0+fN8VqgEjIE9yekNQeqkUEOi9yS3 Cgb16UwtrnQfAvquYWQZXUY0MCQKhThWDA6RHm7pBayHHIRsYj aMFR9X1Jg3huY2fd9bLIBbB8oIKtIHpoISQQ0YuSpTNkCnhHh/ 0ZVXm6QMjKHfnLOd08C3UODvW02LmdXf8ZyzO9h1Om57tOpOoG ZSJZZ3+6rmR64lZEXgF7xDfIS7EPnMO/AgEz6qv1C Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 11:47:20 -0000 On 2013-03-14 10:41, Bruce Evans wrote: > On Wed, 13 Mar 2013, Cody Ritts wrote: > >> So, by setting those CHS values I am: >> making the partition table more compatible with other operating >> systems and BIOSes? >> and giving some utilities the CHS stuff they need to function right? > > It's not completely clear that S=32 H=64 is portable, but it is what most > old SCSI BIOSes used. > > Also, if the disk already has some partitions with a certain geometry, > use > the same geometry for other partitions and don't use fdisk's defaults if > they differ. > > Bruce Oh man... I thought yeah that -a 1 or -a 2048 should work, but it doesn't. And then I thought I'd be extra crafty and use dd to directly write the partition table myself and send that as a solution to you guys, but even that fails! Here's writing a 63 alignment mbr to the disk, just to prove dd can do this: # gdd if=mbr.img of=/dev/md10 bs=512 count=1 1+0 records in 1+0 records out 512 bytes (512 B) copied, 16.8709 s, 0.0 kB/s # gpart show md10 => 63 4194241 md10 MBR (2.0G) 63 40950 1 freebsd (20M) 41013 4153291 - free - (2G) Here's changing the start sector on the first partition to 2048 ;) Writing to the device works with bs=512, but not bs=1, so we use a file and bs=1 to do our edits, and then bs=512 to the disk. # gdd if=<(echo -ne "\x00\x08" ) of=mbr.img bs=1 seek=454 2+0 records in 2+0 records out 2 bytes (2 B) copied, 0.000112023 s, 17.9 kB/s Here's writing the new 2048 aligned mbr to the disk: # gdd if=mbr.img of=/dev/md10 bs=1 count=1 gdd: writing `/dev/md10': *Invalid argument* 1+0 records in 0+0 records out 0 bytes (0 B) copied, 21.0247 s, 0.0 kB/s :O From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 11:53:27 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id D3E3D8F5 for ; Thu, 14 Mar 2013 11:53:27 +0000 (UTC) (envelope-from peter.maloney@brockmann-consult.de) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.187]) by mx1.freebsd.org (Postfix) with ESMTP id 7A7CC73B for ; Thu, 14 Mar 2013 11:53:27 +0000 (UTC) Received: from [10.3.0.26] ([141.4.215.32]) by mrelayeu.kundenserver.de (node=mreu1) with ESMTP (Nemesis) id 0MURRH-1U7cHA1ROl-00R428; Thu, 14 Mar 2013 12:53:15 +0100 Message-ID: <5141BA2A.9080904@brockmann-consult.de> Date: Thu, 14 Mar 2013 12:53:14 +0100 From: Peter Maloney User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Bruce Evans Subject: Re: Aligning MBR for ZFS boot help References: <513C1629.50501@caltel.com> <513CD9AB.5080903@caltel.com> <513CE369.4030303@caltel.com> <1362951595.99445.2.camel@btw.pki2.com> <513E1208.5020804@caltel.com> <20130312203745.A1130@besplex.bde.org> <513F8F04.60206@caltel.com> <20130313232247.B1078@besplex.bde.org> <5140F373.1010907@caltel.com> <20130314195715.Y909@besplex.bde.org> <5141B8B6.4010209@brockmann-consult.de> In-Reply-To: <5141B8B6.4010209@brockmann-consult.de> X-Enigmail-Version: 1.5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:iidejYT6c3uT7Wc+/2P8aXqCpJesMuK8sHf15MFU3+Q AM5plOsrCOEiO2cpPkHan3AdFyGG4niJpP7GjP0o2Y8UgmeCTN 5VyhCA1Q7aUG32GO4ZEjF6LT8CH4FSW6LhhBhMycjQPKrqa2k7 /MGQ+5xcbjJBzgR+wD5Zi5ERfSwprFVB+/CV8STL+Jub2F0q88 P2l8Cu31wusKYNGvLY2BrvHlalVzZFhM6td1Z+S+Lu63oLM/Va 1YsVIiwOC0ebObryTVg1PiAjZWhhZ0rfbGxxlzk64GdfizYo6A 6HzdUqIIAbXTh59AfdYLXrkEziGTmiFWH6w0NKfI2kK1lMqhO4 KDnSf7TGn1aeMXpVObB0M4kcT1eod9RNiLCF3RYoP Cc: freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 11:53:27 -0000 On 2013-03-14 12:47, Peter Maloney wrote: > On 2013-03-14 10:41, Bruce Evans wrote: >> On Wed, 13 Mar 2013, Cody Ritts wrote: >> >>> So, by setting those CHS values I am: >>> making the partition table more compatible with other operating >>> systems and BIOSes? >>> and giving some utilities the CHS stuff they need to function right? >> It's not completely clear that S=32 H=64 is portable, but it is what most >> old SCSI BIOSes used. >> >> Also, if the disk already has some partitions with a certain geometry, >> use >> the same geometry for other partitions and don't use fdisk's defaults if >> they differ. >> >> Bruce > Oh man... I thought yeah that -a 1 or -a 2048 should work, but it > doesn't. And then I thought I'd be extra crafty and use dd to directly > write the partition table myself and send that as a solution to you > guys, but even that fails! > > > Here's writing a 63 alignment mbr to the disk, just to prove dd can do this: > > # gdd if=mbr.img of=/dev/md10 bs=512 count=1 > 1+0 records in > 1+0 records out > 512 bytes (512 B) copied, 16.8709 s, 0.0 kB/s > > # gpart show md10 > => 63 4194241 md10 MBR (2.0G) > 63 40950 1 freebsd (20M) > 41013 4153291 - free - (2G) > > Here's changing the start sector on the first partition to 2048 ;) > Writing to the device works with bs=512, but not bs=1, so we use a file > and bs=1 to do our edits, and then bs=512 to the disk. > > # gdd if=<(echo -ne "\x00\x08" ) of=mbr.img bs=1 seek=454 > 2+0 records in > 2+0 records out > 2 bytes (2 B) copied, 0.000112023 s, 17.9 kB/s > > Here's writing the new 2048 aligned mbr to the disk: > > # gdd if=mbr.img of=/dev/md10 bs=1 count=1 > gdd: writing `/dev/md10': *Invalid argument* > 1+0 records in > 0+0 records out > 0 bytes (0 B) copied, 21.0247 s, 0.0 kB/s > > :O > _________________________________________ Oh, and I almost forgot the most important part... the solution! The solution is to align to 129024 sectors instead, which fits the needs of modern 512/1024/2048 alignment, and also the crazy old thing. # gpart add -t freebsd -a 129024 -s 1M md10 md10s1 added # gpart add -t freebsd -a 129024 -s 1511M md10 md10s2 added # gpart show md10 => 63 4194241 md10 MBR (2.0G) 63 128961 - free - (63M) 129024 2016 1 freebsd (1M) 131040 127008 - free - (62M) 258048 2967552 2 freebsd (1.4G) 3225600 968704 - free - (473M) Above -s numbers are basically random for testing. So now let's check that they are indeed aligned, with the modulus in bc. (Note that strangely, % is only modulus if scale=0 in bc) # bc bc 1.06 Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc. This is free software with ABSOLUTELY NO WARRANTY. For details type `warranty'. 129024%2048 0 258048%2048 0 From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 14:57:29 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A20B7799; Thu, 14 Mar 2013 14:57:29 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 70A2F266; Thu, 14 Mar 2013 14:57:29 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id C4858B985; Thu, 14 Mar 2013 10:57:28 -0400 (EDT) From: John Baldwin To: Konstantin Belousov Subject: Re: Deadlock in the NFS client Date: Thu, 14 Mar 2013 10:57:13 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p25; KDE/4.5.5; amd64; ; ) References: <201303131356.37919.jhb@freebsd.org> <492562517.3880600.1363217615412.JavaMail.root@erie.cs.uoguelph.ca> <20130314092728.GI3794@kib.kiev.ua> In-Reply-To: <20130314092728.GI3794@kib.kiev.ua> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201303141057.13609.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 14 Mar 2013 10:57:28 -0400 (EDT) Cc: Rick Macklem , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 14:57:29 -0000 On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote: > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote: > > John Baldwin wrote: > > > I ran into a machine that had a deadlock among certain files on a > > > given NFS > > > mount today. I'm not sure how best to resolve it, though it seems like > > > perhaps there is a bug with how the pool of nfsiod threads is managed. > > > Anyway, more details on the actual hang below. This was on 8.x with > > > the > > > old NFS client, but I don't see anything in HEAD that would fix this. > > > > > > First note that the system was idle so it had dropped down to only one > > > nfsiod thread. > > > > > Hmm, I see the problem and I'm a bit surprised it doesn't bite more often. > > It seems to me that this snippet of code from nfs_asyncio() makes too > > weak an assumption: > > /* > > * If none are free, we may already have an iod working on this mount > > * point. If so, it will process our request. > > */ > > if (!gotiod) { > > if (nmp->nm_bufqiods > 0) { > > NFS_DPF(ASYNCIO, > > ("nfs_asyncio: %d iods are already processing mount %p\n", > > nmp->nm_bufqiods, nmp)); > > gotiod = TRUE; > > } > > } > > It assumes that, since an nfsiod thread is processing some buffer for the > > mount, it will become available to do this one, which isn't true for your > > deadlock. > > > > I think the simple fix would be to recode nfs_asyncio() so that > > it only returns 0 if it finds an AVAILABLE nfsiod thread that it > > has assigned to do the I/O, getting rid of the above. The problem > > with doing this is that it may result in a lot more synchronous I/O > > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe more > > synchronous I/O could be avoided by allowing nfs_asyncio() to create a > > new thread even if the total is above nfs_iodmax. (I think this would > > require the fixed array to be replaced with a linked list and might > > result in a large number of nfsiod threads.) Maybe just having a large > > nfs_iodmax would be an adequate compromise? > > > > Does having a large # of nfsiod threads cause any serious problem for > > most systems these days? > > > > I'd be tempted to recode nfs_asyncio() as above and then, instead > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed number of > > nfsiod threads (this could be a tunable, with the understanding that > > it should be large for good performance) > > > > I do not see how this would solve the deadlock itself. The proposal would > only allow system to survive slightly longer after the deadlock appeared. > And, I think that allowing the unbound amount of nfsiod threads is also > fatal. > > The issue there is the LOR between buffer lock and vnode lock. Buffer lock > always must come after the vnode lock. The problematic nfsiod thread, which > locks the vnode, volatile this rule, because despite the LK_KERNPROC > ownership of the buffer lock, it is the thread which de fact owns the > buffer (only the thread can unlock it). > > A possible solution would be to pass LK_NOWAIT to nfs_nget() from the > nfs_readdirplusrpc(). From my reading of the code, nfs_nget() should > be capable of correctly handling the lock failure. And EBUSY would > result in doit = 0, which should be fine too. > > It is possible that EBUSY should be reset to 0, though. Yes, thinking about this more, I do think the right answer is for readdirplus to do this. The only question I have is if it should do this always, or if it should do this only from the nfsiod thread. I believe you can't get this in the non-nfsiod case. -- John Baldwin From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 17:22:49 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 63914CEF; Thu, 14 Mar 2013 17:22:49 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id C9DA1DD0; Thu, 14 Mar 2013 17:22:48 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r2EHMegE075450; Thu, 14 Mar 2013 19:22:40 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.0 kib.kiev.ua r2EHMegE075450 Received: (from kostik@localhost) by tom.home (8.14.6/8.14.6/Submit) id r2EHMeDj075449; Thu, 14 Mar 2013 19:22:40 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 14 Mar 2013 19:22:39 +0200 From: Konstantin Belousov To: John Baldwin Subject: Re: Deadlock in the NFS client Message-ID: <20130314172239.GL3794@kib.kiev.ua> References: <201303131356.37919.jhb@freebsd.org> <492562517.3880600.1363217615412.JavaMail.root@erie.cs.uoguelph.ca> <20130314092728.GI3794@kib.kiev.ua> <201303141057.13609.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="f78grIEC2xXSYamv" Content-Disposition: inline In-Reply-To: <201303141057.13609.jhb@freebsd.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: Rick Macklem , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 17:22:49 -0000 --f78grIEC2xXSYamv Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin wrote: > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote: > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote: > > > John Baldwin wrote: > > > > I ran into a machine that had a deadlock among certain files on a > > > > given NFS > > > > mount today. I'm not sure how best to resolve it, though it seems l= ike > > > > perhaps there is a bug with how the pool of nfsiod threads is manag= ed. > > > > Anyway, more details on the actual hang below. This was on 8.x with > > > > the > > > > old NFS client, but I don't see anything in HEAD that would fix thi= s. > > > >=20 > > > > First note that the system was idle so it had dropped down to only = one > > > > nfsiod thread. > > > >=20 > > > Hmm, I see the problem and I'm a bit surprised it doesn't bite more o= ften. > > > It seems to me that this snippet of code from nfs_asyncio() makes too > > > weak an assumption: > > > /* > > > * If none are free, we may already have an iod working on this mount > > > * point. If so, it will process our request. > > > */ > > > if (!gotiod) { > > > if (nmp->nm_bufqiods > 0) { > > > NFS_DPF(ASYNCIO, > > > ("nfs_asyncio: %d iods are already processing mount %p\n", > > > nmp->nm_bufqiods, nmp)); > > > gotiod =3D TRUE; > > > } > > > } > > > It assumes that, since an nfsiod thread is processing some buffer for= the > > > mount, it will become available to do this one, which isn't true for = your > > > deadlock. > > >=20 > > > I think the simple fix would be to recode nfs_asyncio() so that > > > it only returns 0 if it finds an AVAILABLE nfsiod thread that it > > > has assigned to do the I/O, getting rid of the above. The problem > > > with doing this is that it may result in a lot more synchronous I/O > > > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe more > > > synchronous I/O could be avoided by allowing nfs_asyncio() to create a > > > new thread even if the total is above nfs_iodmax. (I think this would > > > require the fixed array to be replaced with a linked list and might > > > result in a large number of nfsiod threads.) Maybe just having a large > > > nfs_iodmax would be an adequate compromise? > > > > > > Does having a large # of nfsiod threads cause any serious problem for > > > most systems these days? > > > > > > I'd be tempted to recode nfs_asyncio() as above and then, instead > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed number of > > > nfsiod threads (this could be a tunable, with the understanding that > > > it should be large for good performance) > > > > >=20 > > I do not see how this would solve the deadlock itself. The proposal wou= ld > > only allow system to survive slightly longer after the deadlock appeare= d. > > And, I think that allowing the unbound amount of nfsiod threads is also > > fatal. > >=20 > > The issue there is the LOR between buffer lock and vnode lock. Buffer l= ock > > always must come after the vnode lock. The problematic nfsiod thread, w= hich > > locks the vnode, volatile this rule, because despite the LK_KERNPROC > > ownership of the buffer lock, it is the thread which de fact owns the > > buffer (only the thread can unlock it). > >=20 > > A possible solution would be to pass LK_NOWAIT to nfs_nget() from the > > nfs_readdirplusrpc(). From my reading of the code, nfs_nget() should > > be capable of correctly handling the lock failure. And EBUSY would > > result in doit =3D 0, which should be fine too. > >=20 > > It is possible that EBUSY should be reset to 0, though. >=20 > Yes, thinking about this more, I do think the right answer is for > readdirplus to do this. The only question I have is if it should do > this always, or if it should do this only from the nfsiod thread. I > believe you can't get this in the non-nfsiod case. I agree that it looks as of the workaround only needed for nfsiod thread. On the other hand, it is not immediately obvious how to detect that the current thread is nfsio daemon. Probably a thread flag should be set. --f78grIEC2xXSYamv Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJRQgdfAAoJEJDCuSvBvK1BJ3AQAJv801LkZRl8NG2tgvt1jm41 o0IcdC0+frAC6yRZQsmwkINJEFhojf5cozWobyBOYzjoC8SKJIJ4JfC82b7tZIgz 3KnRd0cspkuTrM5WbAkKL/MHzZmHAZkNs4VJ2z/Ov+pqedlq+HlecYbH9PUxG7+e ZFSVTIhSP17pHeLFamR4eVKuJxC9H723ca//h5tgAWHu/PBimHOWdT6URJ6C1N/A QuTZqkKSJ8HNkO+DPU89h5wC1IpDXwni+YY5M5rfbc9eisogeQ3k3KW4Jv28oDEe VpPIhSLRZwF/nL/0tn0ha1s62XnAyFYT3r7j1pnFTRssdR0/llB8A3y9vRjkXsaX uRoKholt57JZ7NsbR+yE8CrWQxBePx5cTaxU7k/42eqwm0JGPMiHaq8DChGQuC/m tWaxtva48A0jL37ND+w/mifl/Bmul1s+U6VNwuZ732SQyHCTabTqV7t5InPbRL12 JTc+OOui85MU3wvigVUCKxenp181Hx1No+QXKFVeesUU1NENZWlph5+DxTy/UynB G0v73kO70q1Rs6hQfQPRburyO2t6TdDVTfzpJNeoFR7mPfQ77uRqPWMxaQD6wlPw v7ZZqH2H1adJltP5tzgxGuRDSmjRxcnKvCgICCLlK3n/JneLSmiFDOeOwRqf7G0g 1OA/JWbKedASOThXMHrD =P2wk -----END PGP SIGNATURE----- --f78grIEC2xXSYamv-- From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 18:13:40 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 0189BD39 for ; Thu, 14 Mar 2013 18:13:40 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from mail-qa0-f49.google.com (mail-qa0-f49.google.com [209.85.216.49]) by mx1.freebsd.org (Postfix) with ESMTP id BCBBB172 for ; Thu, 14 Mar 2013 18:13:39 +0000 (UTC) Received: by mail-qa0-f49.google.com with SMTP id o13so1399854qaj.15 for ; Thu, 14 Mar 2013 11:13:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=GYplxzYfiI41tlYW79pzop4H+7jE/XGvwwwev1BwNtU=; b=IZDNwfAY+G0fzdBtGAG2Kf/ZQc238NtfwAyalEwiABkUObRexaue6b0maSH1e0y8fl CMfT/hfF70AFafIFfwZAN5nUb/FntherTWcnM5Mucu4G63NNQD/95CY67CS0FEBzh9n0 8U1WFok6nvj2rQFJ7u4Kq6w8tz8Q2OM/goyE9VwB2BrZenSM/4A3TY2Wqz0FHt0Onc9z ZJR3Y/LNPoPXSw2ST5OjoBor5g0HUrXngYS5VNXEqtiifeOFPc2c6ePIBAn0fPl/h6zm 9npvQVjOIxuZexn+Azn6zqPaPWocLestS42hZ4SF+NPtSvG7CApe8j+jXxJ5M4nwQoql Xotw== MIME-Version: 1.0 X-Received: by 10.229.172.162 with SMTP id l34mr713340qcz.81.1363284818828; Thu, 14 Mar 2013 11:13:38 -0700 (PDT) Received: by 10.49.50.67 with HTTP; Thu, 14 Mar 2013 11:13:38 -0700 (PDT) Date: Thu, 14 Mar 2013 11:13:38 -0700 Message-ID: Subject: Strange slowdown when cache devices enabled in ZFS From: Freddie Cash To: FreeBSD Filesystems Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 18:13:40 -0000 3 storage systems are running this: # uname -a FreeBSD alphadrive.sd73.bc.ca 9.1-STABLE FreeBSD 9.1-STABLE #0 r245466M: Fri Feb 1 09:38:24 PST 2013 root@alphadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST amd64 1 storage system is running this: # uname -a FreeBSD omegadrive.sd73.bc.ca 9.1-STABLE FreeBSD 9.1-STABLE #0 r247804M: Mon Mar 4 10:27:26 PST 2013 root@omegadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST amd64 The last system has manually merged the ZFS "deadman" patch (r 247265 from -CURRENT). All 4 systems exhibit the same symptoms: if a cache device is enabled in the pool, the l2arc_feed_thread of zfskern will spin until it takes up 100% of a CPU core, at which point all I/O to the pool stops. "zpool iostat 1" and "zpool iostat -v 1" show 0 reads and 0 writes to the pool. "gstat -I 1s -f gpt" shows 0 activity to the pool disks. If I remove the cache device from the pool, I/O starts up right away (although it takes several minutes for the remove operation to complete). During the "0 I/O period", any attempt to access the pool "hangs". CTRL+T shows either spa_namespace_lock or tx->tx_something or other (the one when trying to write a transaction to disk). And it will stay like that until the cache device is removed. Hardware is almost the same in all 4 boxes: 3x storage boxes: alphadrive: SuperMicro H8DGi-F motherboard AMD Opteron 6128 CPU (8 cores at 2.0 GHz) 64 GB of DDR3 ECC SDRAM in one box 32 GB SSD for the OS and cache device (GPT partitioned) 24x 2.0 TB WD and Seagate SATA harddrives (4x 6-drive raidz2 vdevs) SuperMicro AOC-USAS-8i SATA controller using mpt driver SuperMicro 4U chassis betadrive: SuperMicro H8DGi-F motherboard AMD Opteron 6128 CPU (8 cores at 2.0 GHz) 48 GB of DDR3 ECC SDRAM in one box 32 GB SSD for the OS and cache device (GPT partitioned) 16x 2.0 TB WD and Seagate SATA harddrives (3x 5-drive raidz2 vdevs + spare) SuperMicro AOC-USAS2-8i SATA controller using mps driver SuperMicro 3U chassis zuludrive: SuperMicro H8DGi-F motherboard AMD Opteron 6128 CPU (8 cores at 2.0 GHz) 32 GB of DDR3 ECC SDRAM in one box 32 GB SSD for the OS and cache device (GPT partitioned) 24x 2.0 TB WD and Seagate SATA harddrives (4x 6-drive raidz2 vdevs) SuperMicro AOC-USAS2-8i SATA controller using mps driver SuperMicro 836 chassis 1x storage box: omegadrive: SuperMicro H8DG6-F motherboard 2x AMD Opteron 6128 CPU (8 cores at 2.0 GHz; 16 cores total) 128 GB of DDR3 ECC SDRAM in one box 2x 60 GB SSD for the OS (gmirror'd) and log devices (ZFS mirror) 2x 120 GB SSD for cache devices 45x 2.0 TB WD and Seagate SATA harddrives (7x 6-drive raidz2 vdevs + 3 spares) LSI 9211-8e SAS controllers using mps driver Onboard LSI 2008 SATA controller using mps driver for OS/log/cache SuperMicro 4U JBOD chassis SuperMicro 2U chassis for motherboard/OS alphadrive, betadrive, and omegadrive all have dedup and lzjb compression enabled. zuludrive has lzjb compression enabled (no dedup). alpha/beta/zulu do rsync backups every night from various local and remote Linux and FreeBSD boxes, then ZFS send the snapshot to omegadrive during the day. The "0 I/O periods" occur most often and most quickly on omegadrive when receiving snapshots, but will eventually occur on all systems during the rsyncs. Things I've tried: - limiting ARC to only 32 GB on each system - limiting L2ARC to 30 GB on each system - enabling the "deadman" patch in case it was I/O requests being lost by the drives/controllers - changing primarycache between all and metadata - increasing arc_meta_limit to just shy of arc_max - removing cache devices completely So far, only the last option works. Without L2ARC, the systems are 100% stable, and can push 200 MB/s of rsync writes and just shy of 500 MB/s of ZFS recv (saturates gigabit link, bursts writes; usually hovers around 50-80 MB/s continuous writes). I'm baffled. An L2ARC is supposed to make things faster, especially when using dedup as the DDT can be cached. -- Freddie Cash fjwcash@gmail.com From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 18:44:56 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C33DB47E for ; Thu, 14 Mar 2013 18:44:56 +0000 (UTC) (envelope-from daniel@digsys.bg) Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.21.123]) by mx1.freebsd.org (Postfix) with ESMTP id 183002B3 for ; Thu, 14 Mar 2013 18:44:55 +0000 (UTC) Received: from [192.168.178.221] ([62.28.165.86]) (authenticated bits=0) by smtp-sofia.digsys.bg (8.14.6/8.14.6) with ESMTP id r2EIfxsq091921 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO); Thu, 14 Mar 2013 20:42:02 +0200 (EET) (envelope-from daniel@digsys.bg) References: Mime-Version: 1.0 (1.0) In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Message-Id: <26456299-66A3-4CCF-9B9A-906D47EFFC93@digsys.bg> X-Mailer: iPad Mail (10B146) From: Daniel Kalchev Subject: Re: Strange slowdown when cache devices enabled in ZFS Date: Thu, 14 Mar 2013 18:41:58 +0000 To: Freddie Cash Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 18:44:56 -0000 Just an idea - have you tried to increase the L2ARC fill rate? That might mo= ve data faster from RAM to L2ARC.. Don't remember the says to offhand. Might be dedup kicking in and not able to get "swapped out" fast enough. Daniel= From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 18:51:41 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 78193919 for ; Thu, 14 Mar 2013 18:51:41 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from mail-qc0-x235.google.com (mail-qc0-x235.google.com [IPv6:2607:f8b0:400d:c01::235]) by mx1.freebsd.org (Postfix) with ESMTP id 3CB8B340 for ; Thu, 14 Mar 2013 18:51:41 +0000 (UTC) Received: by mail-qc0-f181.google.com with SMTP id a22so1190787qcs.12 for ; Thu, 14 Mar 2013 11:51:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=NaEI9E5BSMcb4Dd2FVAfzktgk2sA+P7/cjvLmeDxVzM=; b=WjY7tSBYPWjDxE6bJMOWGG9G3nwcHPIAO9xnIl09cvK+6Zdgi+13rAK21apzHuwrYc JqRpKcO9qIKDzUYbhoQ3ZG62j8dwnAttvzUw3u982ZgaPDrNAoK3Vb1OJuBnPlOTXolc Jyf13lkv8FLKGczxR1mdGC7EI046XiiIr2TCA8nBOxQpwDxmYLVrd3rW/TLyDla1Oa0u zoQoC02uRGljKREUEyi/QfULTYalZMDpkAjA46swY2WSQE8tmfPskjwdUjXHmA9TEyah 4jLkjCOHqDrZvQWr5t5GOJHJjxHZnbPtzm3NBFm/jVysty3riRjJY4DOxHbjfvk5OJz4 UWyA== MIME-Version: 1.0 X-Received: by 10.224.182.70 with SMTP id cb6mr3622266qab.80.1363287100731; Thu, 14 Mar 2013 11:51:40 -0700 (PDT) Received: by 10.49.50.67 with HTTP; Thu, 14 Mar 2013 11:51:40 -0700 (PDT) In-Reply-To: <26456299-66A3-4CCF-9B9A-906D47EFFC93@digsys.bg> References: <26456299-66A3-4CCF-9B9A-906D47EFFC93@digsys.bg> Date: Thu, 14 Mar 2013 11:51:40 -0700 Message-ID: Subject: Re: Strange slowdown when cache devices enabled in ZFS From: Freddie Cash To: Daniel Kalchev Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 18:51:41 -0000 I do have the following set in /boot/loader.conf: vfs.zfs.l2arc_write_boost="160000000" # Set the L2ARC warmup writes to 160 MBps vfs.zfs.l2arc_write_max="320000000" # Set the L2ARC writes to 320 MBps Haven't tried setting them any higher than that, though. During the "0 I/O periods", there's no I/O going to the cache devices either, and they're anywhere from 50% full to almost 100% full (as shown in "zpool list -v" and "zpool iostat" output). ARC use is close to max, though. On Thu, Mar 14, 2013 at 11:41 AM, Daniel Kalchev wrote: > Just an idea - have you tried to increase the L2ARC fill rate? That might > move data faster from RAM to L2ARC.. Don't remember the says to offhand. > > Might be dedup kicking in and not able to get "swapped out" fast enough. > > Daniel -- Freddie Cash fjwcash@gmail.com From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 20:35:40 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A1D262FB; Thu, 14 Mar 2013 20:35:40 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 4DE98A11; Thu, 14 Mar 2013 20:35:40 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id A4A3EB943; Thu, 14 Mar 2013 16:35:39 -0400 (EDT) From: John Baldwin To: Konstantin Belousov Subject: Re: Deadlock in the NFS client Date: Thu, 14 Mar 2013 14:44:35 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p25; KDE/4.5.5; amd64; ; ) References: <201303131356.37919.jhb@freebsd.org> <201303141057.13609.jhb@freebsd.org> <20130314172239.GL3794@kib.kiev.ua> In-Reply-To: <20130314172239.GL3794@kib.kiev.ua> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201303141444.35740.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 14 Mar 2013 16:35:39 -0400 (EDT) Cc: Rick Macklem , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 20:35:40 -0000 On Thursday, March 14, 2013 1:22:39 pm Konstantin Belousov wrote: > On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin wrote: > > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote: > > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote: > > > > John Baldwin wrote: > > > > > I ran into a machine that had a deadlock among certain files on a > > > > > given NFS > > > > > mount today. I'm not sure how best to resolve it, though it seems like > > > > > perhaps there is a bug with how the pool of nfsiod threads is managed. > > > > > Anyway, more details on the actual hang below. This was on 8.x with > > > > > the > > > > > old NFS client, but I don't see anything in HEAD that would fix this. > > > > > > > > > > First note that the system was idle so it had dropped down to only one > > > > > nfsiod thread. > > > > > > > > > Hmm, I see the problem and I'm a bit surprised it doesn't bite more often. > > > > It seems to me that this snippet of code from nfs_asyncio() makes too > > > > weak an assumption: > > > > /* > > > > * If none are free, we may already have an iod working on this mount > > > > * point. If so, it will process our request. > > > > */ > > > > if (!gotiod) { > > > > if (nmp->nm_bufqiods > 0) { > > > > NFS_DPF(ASYNCIO, > > > > ("nfs_asyncio: %d iods are already processing mount %p\n", > > > > nmp->nm_bufqiods, nmp)); > > > > gotiod = TRUE; > > > > } > > > > } > > > > It assumes that, since an nfsiod thread is processing some buffer for the > > > > mount, it will become available to do this one, which isn't true for your > > > > deadlock. > > > > > > > > I think the simple fix would be to recode nfs_asyncio() so that > > > > it only returns 0 if it finds an AVAILABLE nfsiod thread that it > > > > has assigned to do the I/O, getting rid of the above. The problem > > > > with doing this is that it may result in a lot more synchronous I/O > > > > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe more > > > > synchronous I/O could be avoided by allowing nfs_asyncio() to create a > > > > new thread even if the total is above nfs_iodmax. (I think this would > > > > require the fixed array to be replaced with a linked list and might > > > > result in a large number of nfsiod threads.) Maybe just having a large > > > > nfs_iodmax would be an adequate compromise? > > > > > > > > Does having a large # of nfsiod threads cause any serious problem for > > > > most systems these days? > > > > > > > > I'd be tempted to recode nfs_asyncio() as above and then, instead > > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed number of > > > > nfsiod threads (this could be a tunable, with the understanding that > > > > it should be large for good performance) > > > > > > > > > > I do not see how this would solve the deadlock itself. The proposal would > > > only allow system to survive slightly longer after the deadlock appeared. > > > And, I think that allowing the unbound amount of nfsiod threads is also > > > fatal. > > > > > > The issue there is the LOR between buffer lock and vnode lock. Buffer lock > > > always must come after the vnode lock. The problematic nfsiod thread, which > > > locks the vnode, volatile this rule, because despite the LK_KERNPROC > > > ownership of the buffer lock, it is the thread which de fact owns the > > > buffer (only the thread can unlock it). > > > > > > A possible solution would be to pass LK_NOWAIT to nfs_nget() from the > > > nfs_readdirplusrpc(). From my reading of the code, nfs_nget() should > > > be capable of correctly handling the lock failure. And EBUSY would > > > result in doit = 0, which should be fine too. > > > > > > It is possible that EBUSY should be reset to 0, though. > > > > Yes, thinking about this more, I do think the right answer is for > > readdirplus to do this. The only question I have is if it should do > > this always, or if it should do this only from the nfsiod thread. I > > believe you can't get this in the non-nfsiod case. > > I agree that it looks as of the workaround only needed for nfsiod thread. > On the other hand, it is not immediately obvious how to detect that > the current thread is nfsio daemon. Probably a thread flag should be > set. OTOH, updating the attributes from readdir+ is only an optimization anyway, so just having it always do LK_NOWAIT is probably ok (and simple). Currently I'm trying to develop a test case to provoke this so I can test the fix, but no luck on that yet. -- John Baldwin From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 22:39:00 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 4D66E529; Thu, 14 Mar 2013 22:39:00 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id B326BF7E; Thu, 14 Mar 2013 22:38:59 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEADxRQlGDaFvO/2dsb2JhbABDiC65eYJegX10gisBAQUjBFIbDgoRGQIEVQYuh3mvO5JVjVWBDRkbB4ItgRMDjzaDXYNFkQKDJiCBNzU X-IronPort-AV: E=Sophos;i="4.84,848,1355115600"; d="scan'208";a="21301102" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 14 Mar 2013 18:38:49 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id F0BC2B3F1B; Thu, 14 Mar 2013 18:38:48 -0400 (EDT) Date: Thu, 14 Mar 2013 18:38:48 -0400 (EDT) From: Rick Macklem To: John Baldwin Message-ID: <341067813.3923482.1363300728967.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201303141444.35740.jhb@freebsd.org> Subject: Re: Deadlock in the NFS client MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_3923481_469795532.1363300728964" X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 22:39:00 -0000 ------=_Part_3923481_469795532.1363300728964 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit John Baldwin wrote: > On Thursday, March 14, 2013 1:22:39 pm Konstantin Belousov wrote: > > On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin wrote: > > > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote: > > > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote: > > > > > John Baldwin wrote: > > > > > > I ran into a machine that had a deadlock among certain files > > > > > > on a > > > > > > given NFS > > > > > > mount today. I'm not sure how best to resolve it, though it > > > > > > seems like > > > > > > perhaps there is a bug with how the pool of nfsiod threads > > > > > > is managed. > > > > > > Anyway, more details on the actual hang below. This was on > > > > > > 8.x with > > > > > > the > > > > > > old NFS client, but I don't see anything in HEAD that would > > > > > > fix this. > > > > > > > > > > > > First note that the system was idle so it had dropped down > > > > > > to only one > > > > > > nfsiod thread. > > > > > > > > > > > Hmm, I see the problem and I'm a bit surprised it doesn't bite > > > > > more often. > > > > > It seems to me that this snippet of code from nfs_asyncio() > > > > > makes too > > > > > weak an assumption: > > > > > /* > > > > > * If none are free, we may already have an iod working on > > > > > this mount > > > > > * point. If so, it will process our request. > > > > > */ > > > > > if (!gotiod) { > > > > > if (nmp->nm_bufqiods > 0) { > > > > > NFS_DPF(ASYNCIO, > > > > > ("nfs_asyncio: %d iods are already processing mount %p\n", > > > > > nmp->nm_bufqiods, nmp)); > > > > > gotiod = TRUE; > > > > > } > > > > > } > > > > > It assumes that, since an nfsiod thread is processing some > > > > > buffer for the > > > > > mount, it will become available to do this one, which isn't > > > > > true for your > > > > > deadlock. > > > > > > > > > > I think the simple fix would be to recode nfs_asyncio() so > > > > > that > > > > > it only returns 0 if it finds an AVAILABLE nfsiod thread that > > > > > it > > > > > has assigned to do the I/O, getting rid of the above. The > > > > > problem > > > > > with doing this is that it may result in a lot more > > > > > synchronous I/O > > > > > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe > > > > > more > > > > > synchronous I/O could be avoided by allowing nfs_asyncio() to > > > > > create a > > > > > new thread even if the total is above nfs_iodmax. (I think > > > > > this would > > > > > require the fixed array to be replaced with a linked list and > > > > > might > > > > > result in a large number of nfsiod threads.) Maybe just having > > > > > a large > > > > > nfs_iodmax would be an adequate compromise? > > > > > > > > > > Does having a large # of nfsiod threads cause any serious > > > > > problem for > > > > > most systems these days? > > > > > > > > > > I'd be tempted to recode nfs_asyncio() as above and then, > > > > > instead > > > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed > > > > > number of > > > > > nfsiod threads (this could be a tunable, with the > > > > > understanding that > > > > > it should be large for good performance) > > > > > > > > > > > > > I do not see how this would solve the deadlock itself. The > > > > proposal would > > > > only allow system to survive slightly longer after the deadlock > > > > appeared. > > > > And, I think that allowing the unbound amount of nfsiod threads > > > > is also > > > > fatal. > > > > I should mention that what I was thinking of above was more than just getting rid of the snippet of code. It would have involved handing a buffer directly to an available nfsiod thread (no queuing on the mount point). That way there would never be a thread blocked waiting for a queued buffer. However, when I was thinking about it a little after posting, I came to a similar (but even simpler) conclusion than what you've proposed. (See below and attached patch.) > > > > The issue there is the LOR between buffer lock and vnode lock. > > > > Buffer lock > > > > always must come after the vnode lock. The problematic nfsiod > > > > thread, which > > > > locks the vnode, volatile this rule, because despite the > > > > LK_KERNPROC > > > > ownership of the buffer lock, it is the thread which de fact > > > > owns the > > > > buffer (only the thread can unlock it). > > > > > > > > A possible solution would be to pass LK_NOWAIT to nfs_nget() > > > > from the > > > > nfs_readdirplusrpc(). From my reading of the code, nfs_nget() > > > > should > > > > be capable of correctly handling the lock failure. And EBUSY > > > > would > > > > result in doit = 0, which should be fine too. > > > > > > > > It is possible that EBUSY should be reset to 0, though. > > > > > > Yes, thinking about this more, I do think the right answer is for > > > readdirplus to do this. The only question I have is if it should > > > do > > > this always, or if it should do this only from the nfsiod thread. > > > I > > > believe you can't get this in the non-nfsiod case. > > > > I agree that it looks as of the workaround only needed for nfsiod > > thread. > > On the other hand, it is not immediately obvious how to detect that > > the current thread is nfsio daemon. Probably a thread flag should be > > set. > > OTOH, updating the attributes from readdir+ is only an optimization > anyway, so > just having it always do LK_NOWAIT is probably ok (and simple). > Currently I'm > trying to develop a test case to provoke this so I can test the fix, > but no > luck on that yet. > Well, when I was thinking about it a bit after the last email, I was thinking "why bother having the nfsiod threads do readdirplus at all?". The only reason to use the nfsiod threads is read-ahead. However, for readdir this is problematic, since the read-ahead block can only be done when it has the correct directory offset cookie. This implies that it usually waits until the previous directory block has been read. In other words, the read-ahead can't usually start until the previous block has been read. As such, why bother having the nfsiod threads do it? They might be better used for reads and writes. (OpenBSD doesn't do read-aheads for directories. In fact, they don't even keep directory blocks in the buffer cache, although I'm not sure I'd suggest the latter.) It would be nice to compare the performance with/without the attached patch. It might turn out that not using the nfsiod threads for readdir performs just as well or better? Anyhow, the attached trivial patch (which stops readdirplus from being done by the nfsiod threads) might be worth trying? However, I don't see a problem with modifying readdirplus to do a non-blocking vget(), except that it might make an already convoluted function even more convoluted. rick > -- > John Baldwin ------=_Part_3923481_469795532.1363300728964 Content-Type: text/x-patch; name=nfsiod.patch Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename=nfsiod.patch LS0tIGZzL25mc2NsaWVudC9uZnNfY2xiaW8uYy5zYXZ5CTIwMTMtMDMtMTQgMTc6NDk6MzIuMDAw MDAwMDAwIC0wNDAwCisrKyBmcy9uZnNjbGllbnQvbmZzX2NsYmlvLmMJMjAxMy0wMy0xNCAxODow Mjo1My4wMDAwMDAwMDAgLTA0MDAKQEAgLTE0MTUsMTAgKzE0MTUsMTggQEAgbmNsX2FzeW5jaW8o c3RydWN0IG5mc21vdW50ICpubXAsIHN0cnVjdAogCSAqIENvbW1pdHMgYXJlIHVzdWFsbHkgc2hv cnQgYW5kIHN3ZWV0IHNvIGxldHMgc2F2ZSBzb21lIGNwdSBhbmQKIAkgKiBsZWF2ZSB0aGUgYXN5 bmMgZGFlbW9ucyBmb3IgbW9yZSBpbXBvcnRhbnQgcnBjJ3MgKHN1Y2ggYXMgcmVhZHMKIAkgKiBh bmQgd3JpdGVzKS4KKwkgKgorCSAqIFJlYWRkaXJwbHVzIFJQQ3MgZG8gdmdldCgpcyB0byBhY3F1 aXJlIHRoZSB2bm9kZXMgZm9yIGVudHJpZXMKKwkgKiBpbiB0aGUgZGlyZWN0b3J5IGluIG9yZGVy IHRvIHVwZGF0ZSBhdHRyaWJ1dGVzLiBUaGlzIGNhbiBkZWFkbG9jaworCSAqIHdpdGggYW5vdGhl ciB0aHJlYWQgdGhhdCBpcyB3YWl0aW5nIGZvciBhc3luYyBJL08gdG8gYmUgZG9uZSBieQorCSAq IGFuIG5mc2lvZCB0aHJlYWQgd2hpbGUgaG9sZGluZyBhIGxvY2sgb24gb25lIG9mIHRoZXNlIHZu b2Rlcy4KKwkgKiBUbyBhdm9pZCB0aGlzIGRlYWRsb2NrLCBkb24ndCBhbGxvdyB0aGUgYXN5bmMg bmZzaW9kIHRocmVhZHMgdG8KKwkgKiBwZXJmb3JtIFJlYWRkaXJwbHVzIFJQQ3MuCiAJICovCiAJ bXR4X2xvY2soJm5jbF9pb2RfbXV0ZXgpOwotCWlmIChicC0+Yl9pb2NtZCA9PSBCSU9fV1JJVEUg JiYgKGJwLT5iX2ZsYWdzICYgQl9ORUVEQ09NTUlUKSAmJgotCSAgICAobm1wLT5ubV9idWZxaW9k cyA+IG5jbF9udW1hc3luYyAvIDIpKSB7CisJaWYgKChicC0+Yl9pb2NtZCA9PSBCSU9fV1JJVEUg JiYgKGJwLT5iX2ZsYWdzICYgQl9ORUVEQ09NTUlUKSAmJgorCSAgICAgKG5tcC0+bm1fYnVmcWlv ZHMgPiBuY2xfbnVtYXN5bmMgLyAyKSkgfHwKKwkgICAgKGJwLT5iX3ZwLT52X3R5cGUgPT0gVkRJ UiAmJiAobm1wLT5ubV9mbGFnICYgTkZTTU5UX1JESVJQTFVTKSkpIHsKIAkJbXR4X3VubG9jaygm bmNsX2lvZF9tdXRleCk7CiAJCXJldHVybihFSU8pOwogCX0K ------=_Part_3923481_469795532.1363300728964-- From owner-freebsd-fs@FreeBSD.ORG Thu Mar 14 23:23:22 2013 Return-Path: Delivered-To: fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C13988C9; Thu, 14 Mar 2013 23:23:22 +0000 (UTC) (envelope-from pawel@dawidek.net) Received: from mail.dawidek.net (garage.dawidek.net [91.121.88.72]) by mx1.freebsd.org (Postfix) with ESMTP id 7EDF41C0; Thu, 14 Mar 2013 23:23:22 +0000 (UTC) Received: from localhost (89-73-195-149.dynamic.chello.pl [89.73.195.149]) by mail.dawidek.net (Postfix) with ESMTPSA id 4BD9BB02; Fri, 15 Mar 2013 00:20:02 +0100 (CET) Date: Fri, 15 Mar 2013 00:24:50 +0100 From: Pawel Jakub Dawidek To: Bruce Evans Subject: Re: patches to add new stat(2) file flags Message-ID: <20130314232449.GC1446@garage.freebsd.pl> References: <20130307000533.GA38950@nargothrond.kdm.org> <20130307214649.X981@besplex.bde.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="p2kqVDKq5asng8Dg" Content-Disposition: inline In-Reply-To: <20130307214649.X981@besplex.bde.org> X-OS: FreeBSD 10.0-CURRENT amd64 User-Agent: Mutt/1.5.21 (2010-09-15) Cc: arch@FreeBSD.org, "Kenneth D. Merry" , fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Mar 2013 23:23:22 -0000 --p2kqVDKq5asng8Dg Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 07, 2013 at 10:21:38PM +1100, Bruce Evans wrote: > On Wed, 6 Mar 2013, Kenneth D. Merry wrote: >=20 > > I have attached diffs against head for some additional stat(2) file fla= gs. > > > > The primary purpose of these flags is to improve compatibility with CIF= S, > > both from the client and the server side. > > ... > > UF_IMMUTABLE: Command line name: "uchg", "uimmutable" > > ZFS name: XAT_READONLY, ZFS_READONLY > > Windows: FILE_ATTRIBUTE_READONLY > > > > This flag means that the file may not be modified. > > This is not a new flag, but where applicable it is > > mapped to the Windows readonly bit. ZFS and UFS > > now both support the flag and enforce it. > > > > The behavior of this flag is compatible with MacOS X. >=20 > This is incompatible with mapping the DOS read-only attribute to the > non-writeable file permission in msdosfs. msdosfs does this mainly to > get at least one useful file permission, but the semantics are subtly > different from all of file permissions, UF_IMMUTABLE and SF_IMMUTABLE. > I think it should be a new flag. I agree, especially that I saw some discussion recently on Illumos mailing lists to not enforce this flag in ZFS, which would be confusing to FreeBSD users if we forget to _not_ merge that change. --=20 Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://tupytaj.pl --p2kqVDKq5asng8Dg Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlFCXEEACgkQForvXbEpPzSCswCeLMmHONhIZDnAFFCZD+iv2Ghq AygAn0fbIw2k8sJHl5Fv41sUqi4kIjY8 =Tb+w -----END PGP SIGNATURE----- --p2kqVDKq5asng8Dg-- From owner-freebsd-fs@FreeBSD.ORG Fri Mar 15 01:45:37 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 1393D7B0; Fri, 15 Mar 2013 01:45:37 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 7B2968B3; Fri, 15 Mar 2013 01:45:36 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEALh8QlGDaFvO/2dsb2JhbABDiDK5dIJeggF0gisBAQUjBFIbDgoRGQIEVQYuh3mvHpJOjmIZGweCLYETA482hyKRAoMmIIFs X-IronPort-AV: E=Sophos;i="4.84,849,1355115600"; d="scan'208";a="21318975" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 14 Mar 2013 21:45:17 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 502FCB4026; Thu, 14 Mar 2013 21:45:17 -0400 (EDT) Date: Thu, 14 Mar 2013 21:45:17 -0400 (EDT) From: Rick Macklem To: John Baldwin Message-ID: <2115520715.3927772.1363311917302.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201303141444.35740.jhb@freebsd.org> Subject: Re: Deadlock in the NFS client MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_3927771_888202910.1363311917299" X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Mar 2013 01:45:37 -0000 ------=_Part_3927771_888202910.1363311917299 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit John Baldwin wrote: > On Thursday, March 14, 2013 1:22:39 pm Konstantin Belousov wrote: > > On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin wrote: > > > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote: > > > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote: > > > > > John Baldwin wrote: > > > > > > I ran into a machine that had a deadlock among certain files > > > > > > on a > > > > > > given NFS > > > > > > mount today. I'm not sure how best to resolve it, though it > > > > > > seems like > > > > > > perhaps there is a bug with how the pool of nfsiod threads > > > > > > is managed. > > > > > > Anyway, more details on the actual hang below. This was on > > > > > > 8.x with > > > > > > the > > > > > > old NFS client, but I don't see anything in HEAD that would > > > > > > fix this. > > > > > > > > > > > > First note that the system was idle so it had dropped down > > > > > > to only one > > > > > > nfsiod thread. > > > > > > > > > > > Hmm, I see the problem and I'm a bit surprised it doesn't bite > > > > > more often. > > > > > It seems to me that this snippet of code from nfs_asyncio() > > > > > makes too > > > > > weak an assumption: > > > > > /* > > > > > * If none are free, we may already have an iod working on > > > > > this mount > > > > > * point. If so, it will process our request. > > > > > */ > > > > > if (!gotiod) { > > > > > if (nmp->nm_bufqiods > 0) { > > > > > NFS_DPF(ASYNCIO, > > > > > ("nfs_asyncio: %d iods are already processing mount %p\n", > > > > > nmp->nm_bufqiods, nmp)); > > > > > gotiod = TRUE; > > > > > } > > > > > } > > > > > It assumes that, since an nfsiod thread is processing some > > > > > buffer for the > > > > > mount, it will become available to do this one, which isn't > > > > > true for your > > > > > deadlock. > > > > > > > > > > I think the simple fix would be to recode nfs_asyncio() so > > > > > that > > > > > it only returns 0 if it finds an AVAILABLE nfsiod thread that > > > > > it > > > > > has assigned to do the I/O, getting rid of the above. The > > > > > problem > > > > > with doing this is that it may result in a lot more > > > > > synchronous I/O > > > > > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe > > > > > more > > > > > synchronous I/O could be avoided by allowing nfs_asyncio() to > > > > > create a > > > > > new thread even if the total is above nfs_iodmax. (I think > > > > > this would > > > > > require the fixed array to be replaced with a linked list and > > > > > might > > > > > result in a large number of nfsiod threads.) Maybe just having > > > > > a large > > > > > nfs_iodmax would be an adequate compromise? > > > > > > > > > > Does having a large # of nfsiod threads cause any serious > > > > > problem for > > > > > most systems these days? > > > > > > > > > > I'd be tempted to recode nfs_asyncio() as above and then, > > > > > instead > > > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed > > > > > number of > > > > > nfsiod threads (this could be a tunable, with the > > > > > understanding that > > > > > it should be large for good performance) > > > > > > > > > > > > > I do not see how this would solve the deadlock itself. The > > > > proposal would > > > > only allow system to survive slightly longer after the deadlock > > > > appeared. > > > > And, I think that allowing the unbound amount of nfsiod threads > > > > is also > > > > fatal. > > > > > > > > The issue there is the LOR between buffer lock and vnode lock. > > > > Buffer lock > > > > always must come after the vnode lock. The problematic nfsiod > > > > thread, which > > > > locks the vnode, volatile this rule, because despite the > > > > LK_KERNPROC > > > > ownership of the buffer lock, it is the thread which de fact > > > > owns the > > > > buffer (only the thread can unlock it). > > > > > > > > A possible solution would be to pass LK_NOWAIT to nfs_nget() > > > > from the > > > > nfs_readdirplusrpc(). From my reading of the code, nfs_nget() > > > > should > > > > be capable of correctly handling the lock failure. And EBUSY > > > > would > > > > result in doit = 0, which should be fine too. > > > > > > > > It is possible that EBUSY should be reset to 0, though. > > > > > > Yes, thinking about this more, I do think the right answer is for > > > readdirplus to do this. The only question I have is if it should > > > do > > > this always, or if it should do this only from the nfsiod thread. > > > I > > > believe you can't get this in the non-nfsiod case. > > > > I agree that it looks as of the workaround only needed for nfsiod > > thread. > > On the other hand, it is not immediately obvious how to detect that > > the current thread is nfsio daemon. Probably a thread flag should be > > set. > > OTOH, updating the attributes from readdir+ is only an optimization > anyway, so > just having it always do LK_NOWAIT is probably ok (and simple). > Currently I'm > trying to develop a test case to provoke this so I can test the fix, > but no > luck on that yet. > > -- > John Baldwin When I commented out the readahead stuff for VDIR (patch attached), I got much better performance for a: # time ls -lR > /dev/null run at the root of the nfs mount point (-o nfsv3,rdirplus). However, I got about the same performance for the previous patch. (The difference is that this one doesn't play with the buffer cache for the read-ahead attempt.) My test environment is crappy (a laptop mounting itself), so it may just be a side effect of this. Maybe you could try this? rick ------=_Part_3927771_888202910.1363311917299 Content-Type: text/x-patch; name=nfsiod2.patch Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename=nfsiod2.patch LS0tIGZzL25mc2NsaWVudC9uZnNfY2xiaW8uYy5zYXZpdAkyMDEzLTAzLTE0IDIwOjQyOjUwLjAw MDAwMDAwMCAtMDQwMAorKysgZnMvbmZzY2xpZW50L25mc19jbGJpby5jCTIwMTMtMDMtMTQgMjA6 NDM6NDUuMDAwMDAwMDAwIC0wNDAwCkBAIC02NTgsNiArNjU4LDcgQEAgbmNsX2Jpb3JlYWQoc3Ry dWN0IHZub2RlICp2cCwgc3RydWN0IHVpbwogCQkgKiAoWW91IG5lZWQgdGhlIGN1cnJlbnQgYmxv Y2sgZmlyc3QsIHNvIHRoYXQgeW91IGhhdmUgdGhlCiAJCSAqICBkaXJlY3Rvcnkgb2Zmc2V0IGNv b2tpZSBvZiB0aGUgbmV4dCBibG9jay4pCiAJCSAqLworI2lmZGVmIG5vdGRlZgogCQlpZiAobm1w LT5ubV9yZWFkYWhlYWQgPiAwICYmCiAJCSAgICAoYnAtPmJfZmxhZ3MgJiBCX0lOVkFMKSA9PSAw ICYmCiAJCSAgICAobnAtPm5fZGlyZW9mb2Zmc2V0ID09IDAgfHwKQEAgLTY4MCw2ICs2ODEsNyBA QCBuY2xfYmlvcmVhZChzdHJ1Y3Qgdm5vZGUgKnZwLCBzdHJ1Y3QgdWlvCiAJCQkgICAgfQogCQkJ fQogCQl9CisjZW5kaWYKIAkJLyoKIAkJICogVW5saWtlIFZSRUcgZmlsZXMsIHdob3MgYnVmZmVy IHNpemUgKCBicC0+Yl9iY291bnQgKSBpcwogCQkgKiBjaG9wcGVkIGZvciB0aGUgRU9GIGNvbmRp dGlvbiwgd2UgY2Fubm90IHRlbGwgaG93IGxhcmdlCg== ------=_Part_3927771_888202910.1363311917299-- From owner-freebsd-fs@FreeBSD.ORG Fri Mar 15 09:47:48 2013 Return-Path: Delivered-To: fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 2849E385; Fri, 15 Mar 2013 09:47:48 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail28.syd.optusnet.com.au (mail28.syd.optusnet.com.au [211.29.133.169]) by mx1.freebsd.org (Postfix) with ESMTP id A3007DCA; Fri, 15 Mar 2013 09:47:46 +0000 (UTC) Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106]) by mail28.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r2F9lYsg010340 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 15 Mar 2013 20:47:38 +1100 Date: Fri, 15 Mar 2013 20:47:34 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Pawel Jakub Dawidek Subject: Re: patches to add new stat(2) file flags In-Reply-To: <20130314232449.GC1446@garage.freebsd.pl> Message-ID: <20130315184014.A902@besplex.bde.org> References: <20130307000533.GA38950@nargothrond.kdm.org> <20130307214649.X981@besplex.bde.org> <20130314232449.GC1446@garage.freebsd.pl> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=JMpjKL2b c=1 sm=1 a=n2O7wv11oSwA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=YOiZBDKP_E4A:10 a=LVUDrmMsRTOz-s-3SHEA:9 a=CjuIK1q_8ugA:10 a=TEtd8y5WR3g2ypngnwZWYw==:117 Cc: arch@FreeBSD.org, "Kenneth D. Merry" , fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Mar 2013 09:47:48 -0000 On Fri, 15 Mar 2013, Pawel Jakub Dawidek wrote: > On Thu, Mar 07, 2013 at 10:21:38PM +1100, Bruce Evans wrote: >> On Wed, 6 Mar 2013, Kenneth D. Merry wrote: >> >>> I have attached diffs against head for some additional stat(2) file flags. >>> >>> The primary purpose of these flags is to improve compatibility with CIFS, >>> both from the client and the server side. >>> ... >>> UF_IMMUTABLE: Command line name: "uchg", "uimmutable" >>> ZFS name: XAT_READONLY, ZFS_READONLY >>> Windows: FILE_ATTRIBUTE_READONLY >>> >>> This flag means that the file may not be modified. >>> This is not a new flag, but where applicable it is >>> mapped to the Windows readonly bit. ZFS and UFS >>> now both support the flag and enforce it. >>> >>> The behavior of this flag is compatible with MacOS X. >> >> This is incompatible with mapping the DOS read-only attribute to the >> non-writeable file permission in msdosfs. msdosfs does this mainly to >> get at least one useful file permission, but the semantics are subtly >> different from all of file permissions, UF_IMMUTABLE and SF_IMMUTABLE. >> I think it should be a new flag. > > I agree, especially that I saw some discussion recently on Illumos > mailing lists to not enforce this flag in ZFS, which would be confusing > to FreeBSD users if we forget to _not_ merge that change. However, I now think the READONLY attribute would map well to UF_IMMUTABLE in msdosfs, better than the current mapping of the READONLY attribute to the inverse of the write permissions bits. The permissions bits are also controlled by the permissions bits of the mount point, and this is the least worst way to control them for general files. When this is mixed with control by the READONLY attribute (which involves back-control of the READONLY attribute according to the permissions bits), the behaviour is confusing and might lead to the READONLY bit being set for too many files (e.g., for copies of man pages, since man pages are installed with the bogus permissions r--r--r-- although the owner (root) can write them (the r--r--r-- permissions only made sense when the owner was bin)). If the READONLY attribute is instead mapped only to UF_IMMUTABLE, its impact would be smaller since there aren't so many files which have a native READONLY attribute or a native UF_IMMUTABLE attribute. The READONLY attribute would interact badly with the permissions bits in a different way -- just like UF_IMMUTABLE interacts with them. It is confusing when ls -l shows writability for non-writable files. Further testing of possible confusion from UF_IMMUTABLE on a rw-r--r-- uchg file on ffs showed that: - eaccess(2) with flag W_OK used to work correctly, although this was not documented. It used to return the documented errno EACCES, but its man page didn't say anything about immutable attributes and said that this error means that the permissions bits indicate no access (or search permission is denied). - eaccess(2) with flag W_OK now returns the undocumented errno EPERM. Its man page doesn't seem to have changed significantly. Documentation for ACLs also seems to be missing. The old and new man pages point to more details in intro(2). The fine details are missing there too. There is just the usual weaselish "appropriate privilege" used in a generic way for EPERM. This can mean anything, but what it means is not documented in either man page. Actually, eaccess() used to work correctly because I fixed it locally. It seems to have always been broken in FreeBSD. The current version is: @ /* @ * If immutable bit set, nobody gets to write it. "& ~VADMIN_PERMS" @ * is here, because without it, * it would be impossible for the owner @ * to remove the IMMUTABLE flag. @ */ @ if ((accmode & (VMODIFY_PERMS & ~VADMIN_PERMS)) && @ (ip->i_flags & (IMMUTABLE | SF_SNAPSHOT))) @ return (EPERM); Bugs to be fixed here: - the first sentence in the comment is banal and doesn't even echo the code (the code actually handles several immutable bits (obfuscated by the IMMUTABLE macro), and also the snapshot bit) - the second sentence in the comment has a misplaced comment delimiter '*' in the middle of it. It also doesn't fully echo the code, but is not banal. - the "write" in the first sentence also doesn't even echo the code. It used to echo the code when the code was simpler. The code used to check only (accmode & VWRITE). But immutability prevents much more than writing, and the code now handles that. - wrong errno. ext2fs still uses the old ffs code here (except it doesn't use IMMUTABLE and checks explicitly for the only immutable flag that it supports). It duplicates the SF_SNAPSHOT check, but that is nonsense because ext2fs doesn't support snapshots. nandfs copies ffs for setattr, so it has immutabilty flags checks there, but it just uses vaccess() for access(), so it it is missing the above, so the immutable flags checks are either nonsense where they are made or missing here. tmpfs uses the old ffs code here (except for mangling the style, but it does remove the banal comment). I couldn't see exactly what zfs does here, but it mostly returns EPERM for immutable flags checks. Fixing foofs_access() hopefully also fixes open(2), unlink(2), ... Unfortunately, my fix is incompatible with dubious fixes that make the man pages bug for bug compatible with the code. POSIX of course doesn't document EPERM for open(2) (except in the general weasel section about appropriate privilege). FreeBSD didn't document it either in the version in which the above was fixed. But now FreeBSD documents in open.2 and other man pages that immutability gives EPERM, and the code always had this bug. The changes in the man pages have some style bugs: in open.2: - a comma splice in the reference to chflags(2) - this reference is only made in 1 of the descriptions of EPERM. These style bugs were cloned to most or all man pages that are affected by immutability or nounlink flags. ACLs still seem to be unmentioned in all these man pages. I don't use them, so I don't know what happens for them. However, the core vfs function vaccess() is careful to always return EACCES and EPERM as explicitly specified by POSIX. This means EACCES for all cases except VADMIN. VADMIN/EPERM apply to chmod(), chown(), ... but shouldn't apply to open(), unlink(), rename(), ... Bruce From owner-freebsd-fs@FreeBSD.ORG Fri Mar 15 11:34:39 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 76027EF6 for ; Fri, 15 Mar 2013 11:34:39 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id B1B23892 for ; Fri, 15 Mar 2013 11:34:38 +0000 (UTC) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id NAA25999; Fri, 15 Mar 2013 13:34:29 +0200 (EET) (envelope-from avg@FreeBSD.org) Message-ID: <51430744.6020004@FreeBSD.org> Date: Fri, 15 Mar 2013 13:34:28 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130313 Thunderbird/17.0.4 MIME-Version: 1.0 To: Freddie Cash Subject: Re: Strange slowdown when cache devices enabled in ZFS References: In-Reply-To: X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Mar 2013 11:34:39 -0000 on 14/03/2013 20:13 Freddie Cash said the following: > the l2arc_feed_thread of zfskern will spin until it takes up 100% > of a CPU core If you see a thread taking 100% where it shouldn't, then just profile it and actually see what it's doing. -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Fri Mar 15 13:12:57 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id D7DDDC9C for ; Fri, 15 Mar 2013 13:12:57 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from melon.pingpong.net (melon.pingpong.net [79.136.116.200]) by mx1.freebsd.org (Postfix) with ESMTP id 77AE8DFC for ; Fri, 15 Mar 2013 13:12:57 +0000 (UTC) Received: from girgBook.local (citron2.pingpong.net [195.178.173.68]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by melon.pingpong.net (Postfix) with ESMTPSA id 7A20F14D05; Fri, 15 Mar 2013 14:12:49 +0100 (CET) Message-ID: <51431E50.2020109@FreeBSD.org> Date: Fri, 15 Mar 2013 14:12:48 +0100 From: Palle Girgensohn User-Agent: Postbox 3.0.7 (Macintosh/20130119) MIME-Version: 1.0 To: Kirk McKusick Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?), maybe after moving tables and indexes to tablespace on different volume References: <201303131652.r2DGqSr4051899@chez.mckusick.com> In-Reply-To: <201303131652.r2DGqSr4051899@chez.mckusick.com> X-Enigmail-Version: 1.2.3 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, Jeff Roberson X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Mar 2013 13:12:57 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Kirk, Thanks for your reply! Kirk McKusick skrev: > Thanks for your report. It is certainly unlike anything that we have > seen reported before. > > Are you running your /usr filesystem with (the default) journalled > soft updates? You can check this by running the `mount' command with > no arguments. yes, plain vanilla: /dev/da0s1f on /usr (ufs, local, soft-updates) > > Rather than rebooting your system, it would be most helpful if you > could instead shut it down to single user. Then do the following: OK, we have planned downtime this evening. I will run the suggested commands in a script session before rebooting. Best regards, Palle > > Create a transcript of your session by running `script'. Once running > in the session run these commands: > > Run `mount' to show your filesystem configuration. Run `df -hi /usr' > to see whether the inodes are still missing. Verify that you can > cleanly unmount /usr (e.g., that the unmount does not hang and does > not complain). Remount /usr and run `df -hi' to see whether the > inodes are still missing. Unmount /usr again and run `fsck_ffs -p -f > -d /usr'. If the fsck_ffs fails with an unexpected inconsistency, you > can run `fsck_ffs -y -d /usr' to force it to clean up. When you have > the filesystem successfully cleaned up, type `exit' to get out of the > script session and mail me the transcript of the session > (typescript). > > Thanks for your help in tracking this down. > > Kirk McKusick > > ----- Original Message: > > Date: Wed, 13 Mar 2013 11:23:13 +0100 From: Palle Girgensohn > To: freebsd-fs@freebsd.org Subject: leaking lots > of unreferenced inodes (pg_xlog files?), maybe after moving tables > and indexes to tablespace on different volume > > Hi! > > Running postgresql-9.2.2 on FreeBSD 9.1 amd64 using vanilla ufs file > system. > > I have the postgresql base/ on the /usr disk, and a separate volume > /opt where the default tablespace resides. This means that the amount > of data on the /usr disk sould be stable. This is not the case, the > disk usage grows linearly (it seems to leave many inodes > unreferenced). > > The the discrepancy between df and du is now huge: > > # du -sxh /usr; df -h /usr 4,6G /usr Filesystem Size Used > Avail Capacity Mounted on /dev/da0s1f 104G 88G 8.0G 92% > /usr > > 4,6G vs 88GB, that must be more than a rounding error? > > Strange thing is I cannot find any open files among the missing. > > # lsof /usr| awk '{print $9}'|xargs ls -l > /dev/null > > returns no errors (a missing file would render an error with ls). If > there where open files not referenced in any directory, they should > be found. > > Next thing is fsck, and yes, there are plenty of unreferenced files. > > I ran fsck while system is running (i.e. read only) to get a grip > oif the amount of lost inodes: > > fsck /usr | awk '{print $1}'|cut -f 2 -d=| perl -e '$i = 0; while > (<>) { $i += $_;}; print $i / 1024 / 1024; print "\n";' > 85223.3530330658 > > ~85 GB gone, that's 80% of the disk, and it accounts fo all the > missing space. > > MTIME for the inodes are pretty evenly spread over time since the > machine was updated to FreeBSD 9.1, rebooted, and PostgreSQL was > updated to 9.2. All was done at the same time, so I can't really tell > who's to blaim, but this is the only server, out of a dozen that > where updated to exactly the same versions, that has this problem. > All other servers have their /usr disk usage stable (since all data > resides on a separate tablespace). > > The unreferenced inodes are almost exclusively around 16 MB in size, > so they most certainly all are postgresql pg_xlog files. This means > all files are lost from the same portion of code in the database > engine. > > How could it possibly be able to leave unreferenced inodes around > like this at such a scale? Is the culprit a combination of postgresql > and file system code? Both where updated. > > pg_xlog checkpoints seems to happen approximately every three > minutes: > > Mar 13 00:39:08 dbserver postgres[5298]: [48-1] db=,user= LOG: > checkpoint starting: time Mar 13 00:41:38 dbserver postgres[5298]: > [49-1] db=,user= LOG: checkpoint complete: wrote 2542 buffers (0.3%); > 0 transaction log file(s) added, 0 removed, 1 recycled; write=149.667 > s, sync=0.101 s, total=149.770 s; sync files=628, longest=0.021 s, > average=0.000 s Mar 13 00:44:08 dbserver postgres[5298]: [50-1] > db=,user= LOG: checkpoint starting: time Mar 13 00:46:38 dbserver > postgres[5298]: [51-1] db=,user= LOG: checkpoint complete: wrote 3996 > buffers (0.4%); 0 transaction log file(s) added, 0 removed, 1 > recycled; write=149.438 s, sync=0.111 s, total=149.551 s; sync > files=823, longest=0.006 s, average=0.000 s Mar 13 00:49:08 dbserver > postgres[5298]: [52-1] db=,user= LOG: checkpoint starting: time Mar > 13 00:51:38 dbserver postgres[5298]: [53-1] db=,user= LOG: checkpoint > complete: wrote 13736 buffers (1.4%); 0 transaction log file(s) > added, 0 removed, 2 recycled; write=149.958 s, sync=0.311 s, > total=150.271 s; sync files=1335, longest=0.079 s, average=0.000 s > Mar 13 00:54:08 dbserver postgres[5298]: [54-1] db=,user= LOG: > checkpoint starting: time Mar 13 00:56:38 dbserver postgres[5298]: > [55-1] db=,user= LOG: checkpoint complete: wrote 14638 buffers > (1.5%); 0 transaction log file(s) added, 0 removed, 17 recycled; > write=149.330 s, sync=0.271 s, total=149.603 s; sync files=1363, > longest=0.017 s, average=0.000 s Mar 13 00:59:08 dbserver > postgres[5298]: [56-1] db=,user= LOG: checkpoint starting: time Mar > 13 01:01:38 dbserver postgres[5298]: [57-1] db=,user= LOG: checkpoint > complete: wrote 8035 buffers (0.8%); 0 transaction log file(s) added, > 0 removed, 21 recycled; write=149.285 s, sync=0.146 s, total=149.433 > s; sync files=1160, longest=0.003 s, average=0.000 s Mar 13 01:04:08 > dbserver postgres[5298]: [58-1] db=,user= LOG: checkpoint starting: > time Mar 13 01:06:37 dbserver postgres[5298]: [59-1] db=,user= LOG: > checkpoint complete: wrote 2156 buffers (0.2%); 0 transaction log > file(s) added, 0 removed, 9 recycled; write=149.402 s, sync=0.057 s, > total=149.461 s; sync files=610, longest=0.000 s, average=0.000 s Mar > 13 01:09:08 dbserver postgres[5298]: [60-1] db=,user= LOG: checkpoint > starting: time > > > I'm pretty certain that unmounting the file system and running fsck > will regain the lost space, but will it stop there? > > Stopping postgresql briefly did not help, I tried that. That would > have helped if the files where open, but they're not. It seems to > postgresql did the right thing, and FreeBSD failed to unreference the > files. > > The server has about 30 databases and ~127 concurrent connections > (not all beeing active simultaneously, though), so it is fair to say > it is pretty active, but nothing extreme. > > Hardware is HP DL360, using their HT Smart Array P410i. > > Any ideas how to debug this? Or shall I just reboot, fsck, hope the > problem will go away, and when it does, forget about it? > > Thanks, Palle -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 > v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org Comment: > Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iQEcBAEBAgAGBQJRQFORAAoJEIhV+7FrxBJDzVUIAJHU011JDxLxj8/xg05Gwhgq > XK3xB+0N0NSUQ50yhcRKLINz/j/XfeS0ZxlH+MstaPA9y0r1JUXMxkb/uTUvGBiy > jutk3eVe0cati9cVZbJkRU5FxEgmQ0fg0GOMl3RQAErkh5achj+klWvN7PnwGjTs > O3L9RgckKuxTJffk52GAS05qY/TKR6f08kdX3I2cFtqw3tyTyrXU0JPdk2snuPhv > H40xV46zgtWMFDvZLt61MryQ7/JotVQwU78scUB+zxrf8KKM9V0mM7pk0pIbG4Qw > NJBpZJ5gjbl4x+dkQrtZdL65yq88hACYwo9D+83Ct4ig8tgcQ7ViNHWxJqknK7Q= > =3ZZs -----END PGP SIGNATURE----- > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, > send any mail to "freebsd-fs-unsubscribe@freebsd.org" -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJRQx5QAAoJEIhV+7FrxBJDAcYIALvj4hiWoAN/wchrJIbfiXbY XcPNIuqKFT1sRYWgdLZQ7e34zmvtmPfa0WW6/OFHbI5q+G/xciuZLhTl7EZ98IvD jbBUR4SLLcrOvFNe35b43eOqr12okIboLg2fQx/jUbWQM19V/2/YaLobBDl2iv/v gbD5ErL3yd0YBU1EFETho3hsL9fzbmSczQqhWWs0glD+aiHDQbtIAFVkC3IZSaLl MNhqrzKsv4kEHXSylYRU2RbHYKNg55jQ1JHA5HKinqZbe7qLmyqr4dFVtbYEgGE9 DCh4/buO0/UIg+Te7WuD2XxMhfutgbGN6kOaTXk3NQhtgd5a/8I/yqfv6zGtglY= =UFMm -----END PGP SIGNATURE----- From owner-freebsd-fs@FreeBSD.ORG Fri Mar 15 13:21:27 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 79FCAE1E for ; Fri, 15 Mar 2013 13:21:27 +0000 (UTC) (envelope-from phantom@phantom.su) Received: from relay13.nicmail.ru (relay13.nicmail.ru [195.208.6.7]) by mx1.freebsd.org (Postfix) with ESMTP id 0F0A8E72 for ; Fri, 15 Mar 2013 13:21:26 +0000 (UTC) Received: from [109.70.25.39] (port=37155 helo=nicmail.ru) by f17.mail.nic.ru with esmtp (Exim 5.55) (envelope-from ) id 1UGUZT-0006Qg-3g; Fri, 15 Mar 2013 17:21:19 +0400 Received: from [194.85.198.26] (account phantom@phantom.su HELO phantom-mobile.node) by fcgp04.nicmail.ru (CommuniGate Pro SMTP 5.2.3) with ESMTPSA id 326736398; Fri, 15 Mar 2013 17:21:19 +0400 Message-ID: <5143204E.90003@phantom.su> Date: Fri, 15 Mar 2013 17:21:18 +0400 From: Ilia Noskov User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130215 Thunderbird/17.0.3 MIME-Version: 1.0 To: kostikbel@gmail.com Subject: Re: should vn_fullpath1() ever return a path with "." in it? References: <1208475167.3432384.1362099531469.JavaMail.root@erie.cs.uoguelph.ca> <51417C47.8010304@phantom.su> <20130314090847.GH3794@kib.kiev.ua> <5141A212.9050909@phantom.su> In-Reply-To: <5141A212.9050909@phantom.su> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: phantom@phantom.su List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Mar 2013 13:21:27 -0000 On 03/14/2013 02:10 PM, Ilia Noskov wrote: > On 03/14/2013 01:08 PM, Konstantin Belousov wrote: >> On Thu, Mar 14, 2013 at 11:29:11AM +0400, Noskov Ilia wrote: >>> Strange behavior on nfs-client after apply this patch: >>> >>> sysctl debug.disablecwd=0 >>> sysctl debug.disablefullpath=0 >>> >>> # mount -v -t nfs >>> 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid >>> 02ff003a3a000000) >>> # ls /home | wc -l >>> 4946 >>> # cd /home/user6308/.ro >>> # time pwd >>> /home/user6308/.ro >>> 0.008u 0.269s 0:08.47 3.0% 4+157k 0+0io 0pf+0w >>> # ktrace -t+ -i pwd >>> >>> >>> ktrace.out is big (1MB). Attach or not? >>> >>> >>> >>> A small piece of trace: >>> 19527 pwd CALL >>> mmap(0,0x400000,0x3,0x1002,0xffffffff,0) >>> >>> 19527 pwd RET mmap 34376515584/0x801000000 >>> 19527 pwd CALL __getcwd(0x801006400,0x400) >>> 19527 pwd NAMI ".." >>> 19527 pwd NAMI ".." >>> 19527 pwd RET __getcwd -1 errno 2 No such file or directory >>> 19527 pwd CALL stat(0x800947a14,0x7fffffffd940) >>> 19527 pwd NAMI "/" >>> 19527 pwd STRU struct stat {dev=98, ino=2, mode=drwxr-xr-x , >>> nlink=19, uid=0, gid=0, rdev=2120, atime=1363244893, stime=1362653279, >>> ctime=1362653279, birthtime=1200836451, size=1024, blksize=16384, >>> blocks=4, flags=0x0 } >>> 19527 pwd RET stat 0 >>> 19527 pwd CALL lstat(0x80094779c,0x7fffffffd940) >>> 19527 pwd NAMI "." >>> 19527 pwd STRU struct stat {dev=1230702064, ino=145, >>> mode=drwxr-xr-x , nlink=2, uid=0, gid=0, rdev=4294967295, >>> atime=1363244672.246785874, stime=1363244792.864201338, >>> ctime=1363244792.864201338, birthtime=-1, size=3, blksize=4096, >>> blocks=3, flags=0x0 } >>> 19527 pwd RET lstat 0 >>> 19527 pwd CALL openat(0xffffff9c,0x80094779b,0x100000,0x2) >>> 19527 pwd NAMI ".." >>> 19527 pwd RET openat 3 >>> 19527 pwd CALL fstat(0x3,0x7fffffffd880) >>> 19527 pwd STRU struct stat {dev=1230702064, ino=4, >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, >>> atime=1363244665.232140704, stime=1363010116.496298252, >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, >>> blocks=3, flags=0x0 } >>> 19527 pwd RET fstat 0 >>> 19527 pwd CALL fcntl(0x3,F_SETFD,FD_CLOEXEC) >>> 19527 pwd RET fcntl 0 >>> 19527 pwd CALL fstatfs(0x3,0x7fffffffd660) >>> 19527 pwd RET fstatfs 0 >>> 19527 pwd CALL fstat(0x3,0x7fffffffd940) >>> 19527 pwd STRU struct stat {dev=1230702064, ino=4, >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, >>> atime=1363244665.232140704, stime=1363010116.496298252, >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, >>> blocks=3, flags=0x0 } >>> 19527 pwd RET fstat 0 >>> 19527 pwd CALL >>> getdirentries(0x3,0x801018000,0x1000,0x8010160a8) >>> 19527 pwd RET getdirentries 4096/0x1000 >>> 19527 pwd CALL fstat(0x3,0x7fffffffd940) >>> 19527 pwd STRU struct stat {dev=1230702064, ino=4, >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, >>> atime=1363244665.232140704, stime=1363010116.496298252, >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, >>> blocks=3, flags=0x0 } >>> 19527 pwd RET fstat 0 >>> 19527 pwd CALL openat(0x3,0x80094779b,0x100000,0) >>> 19527 pwd NAMI ".." >>> 19527 pwd RET openat 4 >>> [..............................] >>> 19527 pwd CALL madvise(0x801016000,0x1000,MADV_FREE) >>> 19527 pwd RET madvise 0 >>> 19527 pwd CALL madvise(0x801018000,0x2000,MADV_FREE) >>> 19527 pwd RET madvise 0 >>> 19527 pwd CALL close(0x3) >>> 19527 pwd RET close 0 >>> 19527 pwd CALL fstat(0x4,0x7fffffffd880) >>> 19527 pwd STRU struct stat {dev=973143810, ino=4, >>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295, >>> atime=1363244767.460164771, stime=1363172100.380266923, >>> ctime=1363172100.380266923, birthtime=-1, size=4948, blksize=4096, >>> blocks=713, flags=0x0 } >>> 19527 pwd RET fstat 0 >>> 19527 pwd CALL fcntl(0x4,F_SETFD,FD_CLOEXEC) >>> 19527 pwd RET fcntl 0 >>> 19527 pwd CALL fstatfs(0x4,0x7fffffffd660) >>> 19527 pwd RET fstatfs 0 >>> 19527 pwd CALL fstat(0x4,0x7fffffffd940) >>> 19527 pwd STRU struct stat {dev=973143810, ino=4, >>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295, >>> atime=1363244767.460164771, stime=1363172100.380266923, >>> ctime=1363172100.380266923, birthtime=-1, size=4948, blksize=4096, >>> blocks=713, flags=0x0 } >>> 19527 pwd RET fstat 0 >>> 19527 pwd CALL >>> getdirentries(0x4,0x801018000,0x1000,0x8010160a8) >>> 19527 pwd RET getdirentries 4096/0x1000 >>> 19527 pwd CALL fstatat(0x4,0x801018030,0x7fffffffd940,0x200) >>> 19527 pwd NAMI "user6158" >>> 19527 pwd STRU struct stat {dev=1774902232, ino=4, >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, >>> atime=1363009687.040357529, stime=1363010116.496298252, >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, >>> blocks=3, flags=0x0 } >>> 19527 pwd RET fstatat 0 >>> 19527 pwd CALL fstatat(0x4,0x80101804c,0x7fffffffd940,0x200) >>> 19527 pwd NAMI "user2289" >>> 19527 pwd STRU struct stat {dev=1988229825, ino=4, >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, >>> atime=1363009687.040357529, stime=1363010116.496298252, >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, >>> blocks=3, flags=0x0 } >>> 19527 pwd RET fstatat 0 >>> 19527 pwd CALL fstatat(0x4,0x801018068,0x7fffffffd940,0x200) >>> 19527 pwd NAMI "user4761" >>> 19527 pwd STRU struct stat {dev=2438657130, ino=4, >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, >>> atime=1363009687.040357529, stime=1363010116.496298252, >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, >>> blocks=3, flags=0x0 } >>> 19527 pwd RET fstatat 0 >>> 19527 pwd CALL fstatat(0x4,0x801018084,0x7fffffffd940,0x200) >>> 19527 pwd NAMI "user6055" >>> [.........................................] >>> >>> and next get stat of all directories in /home >> >> Slightly different version of the patch was committed as r247560. >> >> The situation could only happen if the parent directory contains the "." >> entry with inode number equal to the inode number of the subdirectory. >> Can you confirm that this is your case ? >> > > Yes, it is. > I'll try again on the latest snapshot. Thanks! > Yes. On latest r248313 similar situation - if path contains "." then nfsclient get stat of all directories in /home. -- Best Regards, Ilia Noskov Regional Network Information Center (RU-CENTER) phone: +7 495 737-0601 fax: +7 495 737-0602 http://www.nic.ru From owner-freebsd-fs@FreeBSD.ORG Fri Mar 15 13:59:56 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 168FE199 for ; Fri, 15 Mar 2013 13:59:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 4E8B6258 for ; Fri, 15 Mar 2013 13:59:55 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r2FDxo98092642; Fri, 15 Mar 2013 15:59:50 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.0 kib.kiev.ua r2FDxo98092642 Received: (from kostik@localhost) by tom.home (8.14.6/8.14.6/Submit) id r2FDxo9B092641; Fri, 15 Mar 2013 15:59:50 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 15 Mar 2013 15:59:50 +0200 From: Konstantin Belousov To: Ilia Noskov Subject: Re: should vn_fullpath1() ever return a path with "." in it? Message-ID: <20130315135950.GU3794@kib.kiev.ua> References: <1208475167.3432384.1362099531469.JavaMail.root@erie.cs.uoguelph.ca> <51417C47.8010304@phantom.su> <20130314090847.GH3794@kib.kiev.ua> <5141A212.9050909@phantom.su> <5143204E.90003@phantom.su> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="IcaJnnNV5xAPxpBT" Content-Disposition: inline In-Reply-To: <5143204E.90003@phantom.su> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Mar 2013 13:59:56 -0000 --IcaJnnNV5xAPxpBT Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Mar 15, 2013 at 05:21:18PM +0400, Ilia Noskov wrote: > On 03/14/2013 02:10 PM, Ilia Noskov wrote: > > On 03/14/2013 01:08 PM, Konstantin Belousov wrote: > >> On Thu, Mar 14, 2013 at 11:29:11AM +0400, Noskov Ilia wrote: > >>> Strange behavior on nfs-client after apply this patch: > >>> > >>> sysctl debug.disablecwd=3D0 > >>> sysctl debug.disablefullpath=3D0 > >>> > >>> # mount -v -t nfs > >>> 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid > >>> 02ff003a3a000000) > >>> # ls /home | wc -l > >>> 4946 > >>> # cd /home/user6308/.ro > >>> # time pwd > >>> /home/user6308/.ro > >>> 0.008u 0.269s 0:08.47 3.0% 4+157k 0+0io 0pf+0w > >>> # ktrace -t+ -i pwd > >>> > >>> > >>> ktrace.out is big (1MB). Attach or not? > >>> > >>> > >>> > >>> A small piece of trace: > >>> 19527 pwd CALL > >>> mmap(0,0x400000,0x3,0x1002,0xffffffff,0) > >>> > >>> 19527 pwd RET mmap 34376515584/0x801000000 > >>> 19527 pwd CALL __getcwd(0x801006400,0x400) > >>> 19527 pwd NAMI ".." > >>> 19527 pwd NAMI ".." > >>> 19527 pwd RET __getcwd -1 errno 2 No such file or directory > >>> 19527 pwd CALL stat(0x800947a14,0x7fffffffd940) > >>> 19527 pwd NAMI "/" > >>> 19527 pwd STRU struct stat {dev=3D98, ino=3D2, mode=3Ddrwxr-= xr-x , > >>> nlink=3D19, uid=3D0, gid=3D0, rdev=3D2120, atime=3D1363244893, stime= =3D1362653279, > >>> ctime=3D1362653279, birthtime=3D1200836451, size=3D1024, blksize=3D16= 384, > >>> blocks=3D4, flags=3D0x0 } > >>> 19527 pwd RET stat 0 > >>> 19527 pwd CALL lstat(0x80094779c,0x7fffffffd940) > >>> 19527 pwd NAMI "." > >>> 19527 pwd STRU struct stat {dev=3D1230702064, ino=3D145, > >>> mode=3Ddrwxr-xr-x , nlink=3D2, uid=3D0, gid=3D0, rdev=3D4294967295, > >>> atime=3D1363244672.246785874, stime=3D1363244792.864201338, > >>> ctime=3D1363244792.864201338, birthtime=3D-1, size=3D3, blksize=3D409= 6, > >>> blocks=3D3, flags=3D0x0 } > >>> 19527 pwd RET lstat 0 > >>> 19527 pwd CALL openat(0xffffff9c,0x80094779b,0x100000,0x2) > >>> 19527 pwd NAMI ".." > >>> 19527 pwd RET openat 3 > >>> 19527 pwd CALL fstat(0x3,0x7fffffffd880) > >>> 19527 pwd STRU struct stat {dev=3D1230702064, ino=3D4, > >>> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295, > >>> atime=3D1363244665.232140704, stime=3D1363010116.496298252, > >>> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D40= 96, > >>> blocks=3D3, flags=3D0x0 } > >>> 19527 pwd RET fstat 0 > >>> 19527 pwd CALL fcntl(0x3,F_SETFD,FD_CLOEXEC) > >>> 19527 pwd RET fcntl 0 > >>> 19527 pwd CALL fstatfs(0x3,0x7fffffffd660) > >>> 19527 pwd RET fstatfs 0 > >>> 19527 pwd CALL fstat(0x3,0x7fffffffd940) > >>> 19527 pwd STRU struct stat {dev=3D1230702064, ino=3D4, > >>> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295, > >>> atime=3D1363244665.232140704, stime=3D1363010116.496298252, > >>> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D40= 96, > >>> blocks=3D3, flags=3D0x0 } > >>> 19527 pwd RET fstat 0 > >>> 19527 pwd CALL > >>> getdirentries(0x3,0x801018000,0x1000,0x8010160a8) > >>> 19527 pwd RET getdirentries 4096/0x1000 > >>> 19527 pwd CALL fstat(0x3,0x7fffffffd940) > >>> 19527 pwd STRU struct stat {dev=3D1230702064, ino=3D4, > >>> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295, > >>> atime=3D1363244665.232140704, stime=3D1363010116.496298252, > >>> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D40= 96, > >>> blocks=3D3, flags=3D0x0 } > >>> 19527 pwd RET fstat 0 > >>> 19527 pwd CALL openat(0x3,0x80094779b,0x100000,0) > >>> 19527 pwd NAMI ".." > >>> 19527 pwd RET openat 4 > >>> [..............................] > >>> 19527 pwd CALL madvise(0x801016000,0x1000,MADV_FREE) > >>> 19527 pwd RET madvise 0 > >>> 19527 pwd CALL madvise(0x801018000,0x2000,MADV_FREE) > >>> 19527 pwd RET madvise 0 > >>> 19527 pwd CALL close(0x3) > >>> 19527 pwd RET close 0 > >>> 19527 pwd CALL fstat(0x4,0x7fffffffd880) > >>> 19527 pwd STRU struct stat {dev=3D973143810, ino=3D4, > >>> mode=3Ddrwxr-xr-x , nlink=3D4948, uid=3D0, gid=3D0, rdev=3D4294967295, > >>> atime=3D1363244767.460164771, stime=3D1363172100.380266923, > >>> ctime=3D1363172100.380266923, birthtime=3D-1, size=3D4948, blksize=3D= 4096, > >>> blocks=3D713, flags=3D0x0 } > >>> 19527 pwd RET fstat 0 > >>> 19527 pwd CALL fcntl(0x4,F_SETFD,FD_CLOEXEC) > >>> 19527 pwd RET fcntl 0 > >>> 19527 pwd CALL fstatfs(0x4,0x7fffffffd660) > >>> 19527 pwd RET fstatfs 0 > >>> 19527 pwd CALL fstat(0x4,0x7fffffffd940) > >>> 19527 pwd STRU struct stat {dev=3D973143810, ino=3D4, > >>> mode=3Ddrwxr-xr-x , nlink=3D4948, uid=3D0, gid=3D0, rdev=3D4294967295, > >>> atime=3D1363244767.460164771, stime=3D1363172100.380266923, > >>> ctime=3D1363172100.380266923, birthtime=3D-1, size=3D4948, blksize=3D= 4096, > >>> blocks=3D713, flags=3D0x0 } > >>> 19527 pwd RET fstat 0 > >>> 19527 pwd CALL > >>> getdirentries(0x4,0x801018000,0x1000,0x8010160a8) > >>> 19527 pwd RET getdirentries 4096/0x1000 > >>> 19527 pwd CALL fstatat(0x4,0x801018030,0x7fffffffd940,0x200) > >>> 19527 pwd NAMI "user6158" > >>> 19527 pwd STRU struct stat {dev=3D1774902232, ino=3D4, > >>> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295, > >>> atime=3D1363009687.040357529, stime=3D1363010116.496298252, > >>> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D40= 96, > >>> blocks=3D3, flags=3D0x0 } > >>> 19527 pwd RET fstatat 0 > >>> 19527 pwd CALL fstatat(0x4,0x80101804c,0x7fffffffd940,0x200) > >>> 19527 pwd NAMI "user2289" > >>> 19527 pwd STRU struct stat {dev=3D1988229825, ino=3D4, > >>> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295, > >>> atime=3D1363009687.040357529, stime=3D1363010116.496298252, > >>> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D40= 96, > >>> blocks=3D3, flags=3D0x0 } > >>> 19527 pwd RET fstatat 0 > >>> 19527 pwd CALL fstatat(0x4,0x801018068,0x7fffffffd940,0x200) > >>> 19527 pwd NAMI "user4761" > >>> 19527 pwd STRU struct stat {dev=3D2438657130, ino=3D4, > >>> mode=3Ddrwxr-xr-x , nlink=3D9, uid=3D0, gid=3D0, rdev=3D4294967295, > >>> atime=3D1363009687.040357529, stime=3D1363010116.496298252, > >>> ctime=3D1363010116.496298252, birthtime=3D-1, size=3D14, blksize=3D40= 96, > >>> blocks=3D3, flags=3D0x0 } > >>> 19527 pwd RET fstatat 0 > >>> 19527 pwd CALL fstatat(0x4,0x801018084,0x7fffffffd940,0x200) > >>> 19527 pwd NAMI "user6055" > >>> [.........................................] > >>> > >>> and next get stat of all directories in /home > >> > >> Slightly different version of the patch was committed as r247560. > >> > >> The situation could only happen if the parent directory contains the "= =2E" > >> entry with inode number equal to the inode number of the subdirectory. > >> Can you confirm that this is your case ? > >> > > > > Yes, it is. > > I'll try again on the latest snapshot. Thanks! > > >=20 > Yes. > On latest r248313 similar situation - if path contains "." then=20 > nfsclient get stat of all directories in /home. What path ? Did you read the description of the situation when r248313 returns ENOENT to delegate the resolving to usermode ? Can you confirm that this is your situation ? --IcaJnnNV5xAPxpBT Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJRQylVAAoJEJDCuSvBvK1BslAP/A28vT8tyv6iYpIKHO0Vthii TKDhLNiCZMw/9udsddqgJfg2IbiqtNH+L/8KqHdpk5pkF7emfolQaeINY4AdxDvo /0x6imNAGBbLkwIpyNGHJstLfqmF8FWBI9wu26wbIIV2Lv0darWkm1qzulJzC8B7 k1NIb5h2Y5SzwjG51ZOxXz5xikvpruDZnDIHKC/wnE+kqu1cy0kd/1aem8+3B5CT J3yc95R+tMiYKkNssUxPgLaobSRF//k1H4uKjtYiEB+ceXoDuwaE7cXjmqrSHpnw jgwdXZWE3b57pwv6Xs7bmAVLsWLTwqm0Qr9R7FcFcC8or9vNDz7dng6OayOS/zG/ tGYfCOldjDzznDQgF7gBDlUiqnBIy1pdwFA7asiwr89JiYv2055n67rxZGUJh590 enjhsCaKQSHONvskARmP+ETxqYHPnsUocku1ebJxkeQV0HQp/qHqnswhk3Sy4yqF gNkKLyipgjUXsV1ryre3zWi8GIaEq4LRNi5BEOkBcPsZHohwZj2N02wnaUQcr6Bi Ynup8pwerxOeBG1O9TiLMD/8VP6jM3mOD/UHtdE9cUFUACju+EMHhRtBo7yJdrXL h/4GiK9Xvf1L3mHMNJv7Yg+f3xvEcGovP3pjYhNAjyusxntT0dHaJTMLgnA+Cf0Z o5j6WYrqS/R0wjrofIOj =Bk3F -----END PGP SIGNATURE----- --IcaJnnNV5xAPxpBT-- From owner-freebsd-fs@FreeBSD.ORG Fri Mar 15 14:25:08 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 4DAD8A52 for ; Fri, 15 Mar 2013 14:25:08 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id DE5373E4 for ; Fri, 15 Mar 2013 14:25:07 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEADYuQ1GDaFvO/2dsb2JhbABDiDG6AIJlgXt0gioBAQUjVhsOCgICDRkCWQaIJ7A1knOBI40+NAeCLYETA5ZbiWyHFoMmIIFs X-IronPort-AV: E=Sophos;i="4.84,850,1355115600"; d="scan'208";a="19169934" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu.net.uoguelph.ca with ESMTP; 15 Mar 2013 10:25:00 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id B45EBB4034; Fri, 15 Mar 2013 10:25:00 -0400 (EDT) Date: Fri, 15 Mar 2013 10:25:00 -0400 (EDT) From: Rick Macklem To: Konstantin Belousov Message-ID: <2081421885.3937873.1363357500724.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <20130315135950.GU3794@kib.kiev.ua> Subject: Re: should vn_fullpath1() ever return a path with "." in it? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Mar 2013 14:25:08 -0000 Kostik Belousov wrote: > On Fri, Mar 15, 2013 at 05:21:18PM +0400, Ilia Noskov wrote: > > On 03/14/2013 02:10 PM, Ilia Noskov wrote: > > > On 03/14/2013 01:08 PM, Konstantin Belousov wrote: > > >> On Thu, Mar 14, 2013 at 11:29:11AM +0400, Noskov Ilia wrote: > > >>> Strange behavior on nfs-client after apply this patch: > > >>> > > >>> sysctl debug.disablecwd=0 > > >>> sysctl debug.disablefullpath=0 > > >>> > > >>> # mount -v -t nfs > > >>> 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid > > >>> 02ff003a3a000000) You don't mention how your server is configured. Does your V4: line in /etc/exports have "/" as the root or "/pool". See my comment at the bottom for why this might matter. Also, if you have a recent version of nfsstat, you can use "nfsstat -m" to dump out exactly what options the mount is actually using. (Gives a lot more info than the above.) > > >>> # ls /home | wc -l > > >>> 4946 > > >>> # cd /home/user6308/.ro Is user6308 a separate file system than /home on the server? (If so, I would expect the userland getcwd() to stat all the entries in /home to get the st_dev and st_ino fields of them all.) > > >>> # time pwd > > >>> /home/user6308/.ro > > >>> 0.008u 0.269s 0:08.47 3.0% 4+157k 0+0io 0pf+0w > > >>> # ktrace -t+ -i pwd > > >>> > > >>> > > >>> ktrace.out is big (1MB). Attach or not? > > >>> > > >>> > > >>> > > >>> A small piece of trace: > > >>> 19527 pwd CALL > > >>> mmap(0,0x400000,0x3,0x1002,0xffffffff,0) > > >>> > > >>> 19527 pwd RET mmap 34376515584/0x801000000 > > >>> 19527 pwd CALL __getcwd(0x801006400,0x400) > > >>> 19527 pwd NAMI ".." > > >>> 19527 pwd NAMI ".." > > >>> 19527 pwd RET __getcwd -1 errno 2 No such file or directory > > >>> 19527 pwd CALL stat(0x800947a14,0x7fffffffd940) > > >>> 19527 pwd NAMI "/" > > >>> 19527 pwd STRU struct stat {dev=98, ino=2, mode=drwxr-xr-x , > > >>> nlink=19, uid=0, gid=0, rdev=2120, atime=1363244893, > > >>> stime=1362653279, > > >>> ctime=1362653279, birthtime=1200836451, size=1024, > > >>> blksize=16384, > > >>> blocks=4, flags=0x0 } > > >>> 19527 pwd RET stat 0 > > >>> 19527 pwd CALL lstat(0x80094779c,0x7fffffffd940) > > >>> 19527 pwd NAMI "." > > >>> 19527 pwd STRU struct stat {dev=1230702064, ino=145, > > >>> mode=drwxr-xr-x , nlink=2, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363244672.246785874, stime=1363244792.864201338, > > >>> ctime=1363244792.864201338, birthtime=-1, size=3, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET lstat 0 > > >>> 19527 pwd CALL openat(0xffffff9c,0x80094779b,0x100000,0x2) > > >>> 19527 pwd NAMI ".." > > >>> 19527 pwd RET openat 3 > > >>> 19527 pwd CALL fstat(0x3,0x7fffffffd880) > > >>> 19527 pwd STRU struct stat {dev=1230702064, ino=4, > > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363244665.232140704, stime=1363010116.496298252, > > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET fstat 0 > > >>> 19527 pwd CALL fcntl(0x3,F_SETFD,FD_CLOEXEC) > > >>> 19527 pwd RET fcntl 0 > > >>> 19527 pwd CALL fstatfs(0x3,0x7fffffffd660) > > >>> 19527 pwd RET fstatfs 0 > > >>> 19527 pwd CALL fstat(0x3,0x7fffffffd940) > > >>> 19527 pwd STRU struct stat {dev=1230702064, ino=4, > > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363244665.232140704, stime=1363010116.496298252, > > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET fstat 0 > > >>> 19527 pwd CALL > > >>> getdirentries(0x3,0x801018000,0x1000,0x8010160a8) > > >>> 19527 pwd RET getdirentries 4096/0x1000 > > >>> 19527 pwd CALL fstat(0x3,0x7fffffffd940) > > >>> 19527 pwd STRU struct stat {dev=1230702064, ino=4, > > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363244665.232140704, stime=1363010116.496298252, > > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET fstat 0 > > >>> 19527 pwd CALL openat(0x3,0x80094779b,0x100000,0) > > >>> 19527 pwd NAMI ".." > > >>> 19527 pwd RET openat 4 > > >>> [..............................] > > >>> 19527 pwd CALL madvise(0x801016000,0x1000,MADV_FREE) > > >>> 19527 pwd RET madvise 0 > > >>> 19527 pwd CALL madvise(0x801018000,0x2000,MADV_FREE) > > >>> 19527 pwd RET madvise 0 > > >>> 19527 pwd CALL close(0x3) > > >>> 19527 pwd RET close 0 > > >>> 19527 pwd CALL fstat(0x4,0x7fffffffd880) > > >>> 19527 pwd STRU struct stat {dev=973143810, ino=4, > > >>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363244767.460164771, stime=1363172100.380266923, > > >>> ctime=1363172100.380266923, birthtime=-1, size=4948, > > >>> blksize=4096, > > >>> blocks=713, flags=0x0 } > > >>> 19527 pwd RET fstat 0 > > >>> 19527 pwd CALL fcntl(0x4,F_SETFD,FD_CLOEXEC) > > >>> 19527 pwd RET fcntl 0 > > >>> 19527 pwd CALL fstatfs(0x4,0x7fffffffd660) > > >>> 19527 pwd RET fstatfs 0 > > >>> 19527 pwd CALL fstat(0x4,0x7fffffffd940) > > >>> 19527 pwd STRU struct stat {dev=973143810, ino=4, > > >>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363244767.460164771, stime=1363172100.380266923, > > >>> ctime=1363172100.380266923, birthtime=-1, size=4948, > > >>> blksize=4096, > > >>> blocks=713, flags=0x0 } > > >>> 19527 pwd RET fstat 0 > > >>> 19527 pwd CALL > > >>> getdirentries(0x4,0x801018000,0x1000,0x8010160a8) > > >>> 19527 pwd RET getdirentries 4096/0x1000 > > >>> 19527 pwd CALL fstatat(0x4,0x801018030,0x7fffffffd940,0x200) > > >>> 19527 pwd NAMI "user6158" > > >>> 19527 pwd STRU struct stat {dev=1774902232, ino=4, > > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363009687.040357529, stime=1363010116.496298252, > > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET fstatat 0 > > >>> 19527 pwd CALL fstatat(0x4,0x80101804c,0x7fffffffd940,0x200) > > >>> 19527 pwd NAMI "user2289" > > >>> 19527 pwd STRU struct stat {dev=1988229825, ino=4, > > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363009687.040357529, stime=1363010116.496298252, > > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET fstatat 0 > > >>> 19527 pwd CALL fstatat(0x4,0x801018068,0x7fffffffd940,0x200) > > >>> 19527 pwd NAMI "user4761" > > >>> 19527 pwd STRU struct stat {dev=2438657130, ino=4, > > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363009687.040357529, stime=1363010116.496298252, > > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET fstatat 0 > > >>> 19527 pwd CALL fstatat(0x4,0x801018084,0x7fffffffd940,0x200) > > >>> 19527 pwd NAMI "user6055" > > >>> [.........................................] > > >>> > > >>> and next get stat of all directories in /home > > >> > > >> Slightly different version of the patch was committed as r247560. > > >> > > >> The situation could only happen if the parent directory contains > > >> the "." > > >> entry with inode number equal to the inode number of the > > >> subdirectory. > > >> Can you confirm that this is your case ? > > >> > > > > > > Yes, it is. > > > I'll try again on the latest snapshot. Thanks! > > > > > > > Yes. > > On latest r248313 similar situation - if path contains "." then > > nfsclient get stat of all directories in /home. > > What path ? > > Did you read the description of the situation when r248313 returns > ENOENT to delegate the resolving to usermode ? Can you confirm that > this is your situation ? I think the patch is doing the correct thing. When __getcwd() returns ENOENT, the userland algorithm in getcwd() must stat() all entries in the directory to figure out if it is a mount point using the st_ino and st_dev fields. (You can look at it in the libc sources, if you'd like. Just look for getcwd.c.) I think this is what Kostik is referring to. If I understand your issue, it is that this takes a long time, since /home is large. There are a couple of things that *might* reduce the time this takes. 1 - If you are NFSv4 mounting "/pool", you could specify /pool as the root in you V4: /etc/exports line. Something like: V4: /pool ... Then you do the mount with 192.168.168.1:/. This will make the "/pool" a root point and might avoid the problem, but I am not sure. 2 - Adding rdirplus to the mount options will make it get attributes for entries in a directory when it does a readdir and cache them. This might speed things up. rick From owner-freebsd-fs@FreeBSD.ORG Fri Mar 15 14:37:33 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 2EED2CCB for ; Fri, 15 Mar 2013 14:37:33 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id D66D463B for ; Fri, 15 Mar 2013 14:37:32 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAM8xQ1GDaFvO/2dsb2JhbABDiDG6AIJlgXt0gioBAQUjVhsOCgICDRkCWQaIJ7AfknSBI40+NAeCLYETA5ZbiWyHFoMmIIFs X-IronPort-AV: E=Sophos;i="4.84,850,1355115600"; d="scan'208";a="21391005" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 15 Mar 2013 10:37:31 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id BBBABB4036; Fri, 15 Mar 2013 10:37:31 -0400 (EDT) Date: Fri, 15 Mar 2013 10:37:31 -0400 (EDT) From: Rick Macklem To: Konstantin Belousov Message-ID: <800896676.3938590.1363358251751.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <20130315135950.GU3794@kib.kiev.ua> Subject: Re: should vn_fullpath1() ever return a path with "." in it? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Mar 2013 14:37:33 -0000 Kostik Belousov wrote: > On Fri, Mar 15, 2013 at 05:21:18PM +0400, Ilia Noskov wrote: > > On 03/14/2013 02:10 PM, Ilia Noskov wrote: > > > On 03/14/2013 01:08 PM, Konstantin Belousov wrote: > > >> On Thu, Mar 14, 2013 at 11:29:11AM +0400, Noskov Ilia wrote: > > >>> Strange behavior on nfs-client after apply this patch: > > >>> > > >>> sysctl debug.disablecwd=0 > > >>> sysctl debug.disablefullpath=0 > > >>> > > >>> # mount -v -t nfs > > >>> 192.168.168.1:/pool on /home (nfs, noatime, nfsv4acls, fsid > > >>> 02ff003a3a000000) > > >>> # ls /home | wc -l > > >>> 4946 > > >>> # cd /home/user6308/.ro I forgot to mention in the previous post. If user6308 is a different file system than /home, forget about my suggestion #1, because I know it won't help for this case. > > >>> # time pwd > > >>> /home/user6308/.ro > > >>> 0.008u 0.269s 0:08.47 3.0% 4+157k 0+0io 0pf+0w > > >>> # ktrace -t+ -i pwd > > >>> > > >>> > > >>> ktrace.out is big (1MB). Attach or not? > > >>> > > >>> > > >>> > > >>> A small piece of trace: > > >>> 19527 pwd CALL > > >>> mmap(0,0x400000,0x3,0x1002,0xffffffff,0) > > >>> > > >>> 19527 pwd RET mmap 34376515584/0x801000000 > > >>> 19527 pwd CALL __getcwd(0x801006400,0x400) > > >>> 19527 pwd NAMI ".." > > >>> 19527 pwd NAMI ".." > > >>> 19527 pwd RET __getcwd -1 errno 2 No such file or directory > > >>> 19527 pwd CALL stat(0x800947a14,0x7fffffffd940) > > >>> 19527 pwd NAMI "/" > > >>> 19527 pwd STRU struct stat {dev=98, ino=2, mode=drwxr-xr-x , > > >>> nlink=19, uid=0, gid=0, rdev=2120, atime=1363244893, > > >>> stime=1362653279, > > >>> ctime=1362653279, birthtime=1200836451, size=1024, > > >>> blksize=16384, > > >>> blocks=4, flags=0x0 } > > >>> 19527 pwd RET stat 0 > > >>> 19527 pwd CALL lstat(0x80094779c,0x7fffffffd940) > > >>> 19527 pwd NAMI "." > > >>> 19527 pwd STRU struct stat {dev=1230702064, ino=145, > > >>> mode=drwxr-xr-x , nlink=2, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363244672.246785874, stime=1363244792.864201338, > > >>> ctime=1363244792.864201338, birthtime=-1, size=3, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET lstat 0 > > >>> 19527 pwd CALL openat(0xffffff9c,0x80094779b,0x100000,0x2) > > >>> 19527 pwd NAMI ".." > > >>> 19527 pwd RET openat 3 > > >>> 19527 pwd CALL fstat(0x3,0x7fffffffd880) > > >>> 19527 pwd STRU struct stat {dev=1230702064, ino=4, > > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363244665.232140704, stime=1363010116.496298252, > > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET fstat 0 > > >>> 19527 pwd CALL fcntl(0x3,F_SETFD,FD_CLOEXEC) > > >>> 19527 pwd RET fcntl 0 > > >>> 19527 pwd CALL fstatfs(0x3,0x7fffffffd660) > > >>> 19527 pwd RET fstatfs 0 > > >>> 19527 pwd CALL fstat(0x3,0x7fffffffd940) > > >>> 19527 pwd STRU struct stat {dev=1230702064, ino=4, > > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363244665.232140704, stime=1363010116.496298252, > > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET fstat 0 > > >>> 19527 pwd CALL > > >>> getdirentries(0x3,0x801018000,0x1000,0x8010160a8) > > >>> 19527 pwd RET getdirentries 4096/0x1000 > > >>> 19527 pwd CALL fstat(0x3,0x7fffffffd940) > > >>> 19527 pwd STRU struct stat {dev=1230702064, ino=4, > > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363244665.232140704, stime=1363010116.496298252, > > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET fstat 0 > > >>> 19527 pwd CALL openat(0x3,0x80094779b,0x100000,0) > > >>> 19527 pwd NAMI ".." > > >>> 19527 pwd RET openat 4 > > >>> [..............................] > > >>> 19527 pwd CALL madvise(0x801016000,0x1000,MADV_FREE) > > >>> 19527 pwd RET madvise 0 > > >>> 19527 pwd CALL madvise(0x801018000,0x2000,MADV_FREE) > > >>> 19527 pwd RET madvise 0 > > >>> 19527 pwd CALL close(0x3) > > >>> 19527 pwd RET close 0 > > >>> 19527 pwd CALL fstat(0x4,0x7fffffffd880) > > >>> 19527 pwd STRU struct stat {dev=973143810, ino=4, > > >>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363244767.460164771, stime=1363172100.380266923, > > >>> ctime=1363172100.380266923, birthtime=-1, size=4948, > > >>> blksize=4096, > > >>> blocks=713, flags=0x0 } > > >>> 19527 pwd RET fstat 0 > > >>> 19527 pwd CALL fcntl(0x4,F_SETFD,FD_CLOEXEC) > > >>> 19527 pwd RET fcntl 0 > > >>> 19527 pwd CALL fstatfs(0x4,0x7fffffffd660) > > >>> 19527 pwd RET fstatfs 0 > > >>> 19527 pwd CALL fstat(0x4,0x7fffffffd940) > > >>> 19527 pwd STRU struct stat {dev=973143810, ino=4, > > >>> mode=drwxr-xr-x , nlink=4948, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363244767.460164771, stime=1363172100.380266923, > > >>> ctime=1363172100.380266923, birthtime=-1, size=4948, > > >>> blksize=4096, > > >>> blocks=713, flags=0x0 } > > >>> 19527 pwd RET fstat 0 > > >>> 19527 pwd CALL > > >>> getdirentries(0x4,0x801018000,0x1000,0x8010160a8) > > >>> 19527 pwd RET getdirentries 4096/0x1000 > > >>> 19527 pwd CALL fstatat(0x4,0x801018030,0x7fffffffd940,0x200) > > >>> 19527 pwd NAMI "user6158" > > >>> 19527 pwd STRU struct stat {dev=1774902232, ino=4, > > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363009687.040357529, stime=1363010116.496298252, > > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET fstatat 0 > > >>> 19527 pwd CALL fstatat(0x4,0x80101804c,0x7fffffffd940,0x200) > > >>> 19527 pwd NAMI "user2289" > > >>> 19527 pwd STRU struct stat {dev=1988229825, ino=4, > > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363009687.040357529, stime=1363010116.496298252, > > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET fstatat 0 > > >>> 19527 pwd CALL fstatat(0x4,0x801018068,0x7fffffffd940,0x200) > > >>> 19527 pwd NAMI "user4761" > > >>> 19527 pwd STRU struct stat {dev=2438657130, ino=4, > > >>> mode=drwxr-xr-x , nlink=9, uid=0, gid=0, rdev=4294967295, > > >>> atime=1363009687.040357529, stime=1363010116.496298252, > > >>> ctime=1363010116.496298252, birthtime=-1, size=14, blksize=4096, > > >>> blocks=3, flags=0x0 } > > >>> 19527 pwd RET fstatat 0 > > >>> 19527 pwd CALL fstatat(0x4,0x801018084,0x7fffffffd940,0x200) > > >>> 19527 pwd NAMI "user6055" > > >>> [.........................................] > > >>> > > >>> and next get stat of all directories in /home > > >> > > >> Slightly different version of the patch was committed as r247560. > > >> > > >> The situation could only happen if the parent directory contains > > >> the "." > > >> entry with inode number equal to the inode number of the > > >> subdirectory. > > >> Can you confirm that this is your case ? > > >> > > > > > > Yes, it is. > > > I'll try again on the latest snapshot. Thanks! > > > > > > > Yes. > > On latest r248313 similar situation - if path contains "." then > > nfsclient get stat of all directories in /home. > > What path ? > > Did you read the description of the situation when r248313 returns > ENOENT to delegate the resolving to usermode ? Can you confirm that > this is your situation ? From owner-freebsd-fs@FreeBSD.ORG Fri Mar 15 14:58:58 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 128DF921; Fri, 15 Mar 2013 14:58:58 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from mail-qa0-f48.google.com (mail-qa0-f48.google.com [209.85.216.48]) by mx1.freebsd.org (Postfix) with ESMTP id B4C0275D; Fri, 15 Mar 2013 14:58:57 +0000 (UTC) Received: by mail-qa0-f48.google.com with SMTP id j8so336636qah.0 for ; Fri, 15 Mar 2013 07:58:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=qkDuTVQJ6xqFXDMQGdzSS6hRckvP99eSBCbNyltoeoc=; b=WlOM50D7ZgsjK2u0UpZOHzNPDZutSZNir/iL8rVtDFYeBkMrCLmNDG89kOpMSc2BLt 0VA+gQ7wEInUZdR1RvEs3OoOlxhOhdQXnAIpk0baTWSDrvneMMOENENvCdA4cqkLl+a0 g8JxeuIYseV0CvtNyPsCTfqmmj1R6t1weFhtvMi6RH6XBrFBIo2uKLMDaJxrQmMgJ+4q u0C5zAo7athMplD69T3rHZTa7IxgwgYEQX3/ypbX1FZtPv5IKBEhK1oCrjySiV000Sez b332v3Gvgh0rseheluJvy0uc1pZP0jvdn1MCZ4gUibC+m5FzP04q0lALsoa7lta0o0y1 CWVA== MIME-Version: 1.0 X-Received: by 10.49.128.170 with SMTP id np10mr6042926qeb.37.1363359531070; Fri, 15 Mar 2013 07:58:51 -0700 (PDT) Received: by 10.49.50.67 with HTTP; Fri, 15 Mar 2013 07:58:50 -0700 (PDT) In-Reply-To: <51430744.6020004@FreeBSD.org> References: <51430744.6020004@FreeBSD.org> Date: Fri, 15 Mar 2013 07:58:50 -0700 Message-ID: Subject: Re: Strange slowdown when cache devices enabled in ZFS From: Freddie Cash To: Andriy Gapon Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Mar 2013 14:58:58 -0000 How does one do that? I've never done that before. Point me to some docs, and I'll see what I can find out. On Fri, Mar 15, 2013 at 4:34 AM, Andriy Gapon wrote: > on 14/03/2013 20:13 Freddie Cash said the following: > > the l2arc_feed_thread of zfskern will spin until it takes up 100% > > of a CPU core > > If you see a thread taking 100% where it shouldn't, then just profile it > and > actually see what it's doing. > > -- > Andriy Gapon > -- Freddie Cash fjwcash@gmail.com From owner-freebsd-fs@FreeBSD.ORG Fri Mar 15 21:27:34 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 69F416CF for ; Fri, 15 Mar 2013 21:27:34 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from melon.pingpong.net (melon.pingpong.net [79.136.116.200]) by mx1.freebsd.org (Postfix) with ESMTP id A0BB56D2 for ; Fri, 15 Mar 2013 21:27:33 +0000 (UTC) Received: from girgBook.local (c-1754e155.1525-1-64736c12.cust.bredbandsbolaget.se [85.225.84.23]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by melon.pingpong.net (Postfix) with ESMTPSA id C29BE14732; Fri, 15 Mar 2013 22:27:31 +0100 (CET) Message-ID: <51439243.5020604@FreeBSD.org> Date: Fri, 15 Mar 2013 22:27:31 +0100 From: Palle Girgensohn User-Agent: Postbox 3.0.7 (Macintosh/20130119) MIME-Version: 1.0 To: Kirk McKusick Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?), maybe after moving tables and indexes to tablespace on different volume References: <201303131652.r2DGqSr4051899@chez.mckusick.com> In-Reply-To: <201303131652.r2DGqSr4051899@chez.mckusick.com> X-Enigmail-Version: 1.2.3 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, Jeff Roberson X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Mar 2013 21:27:34 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Kirk McKusick skrev: > Thanks for your report. It is certainly unlike anything that we have > seen reported before. > > Are you running your /usr filesystem with (the default) journalled > soft updates? You can check this by running the `mount' command with > no arguments. > > Rather than rebooting your system, it would be most helpful if you > could instead shut it down to single user. Then do the following: > > Create a transcript of your session by running `script'. Once running > in the session run these commands: > > Run `mount' to show your filesystem configuration. Run `df -hi /usr' > to see whether the inodes are still missing. Verify that you can > cleanly unmount /usr (e.g., that the unmount does not hang and does > not complain). Remount /usr and run `df -hi' to see whether the > inodes are still missing. Unmount /usr again and run `fsck_ffs -p -f > -d /usr'. If the fsck_ffs fails with an unexpected inconsistency, you > can run `fsck_ffs -y -d /usr' to force it to clean up. When you have > the filesystem successfully cleaned up, type `exit' to get out of the > script session and mail me the transcript of the session > (typescript). > > Thanks for your help in tracking this down. > > Kirk McKusick Hi again, A umount + mount was enough to reclaim the space. Script started on Fri Mar 15 19:02:22 2013 # mount /dev/da0s1a on / (ufs, local) devfs on /dev (devfs, local, multilabel) /dev/da0s1d on /tmp (ufs, local, soft-updates) /dev/da0s1f on /usr (ufs, local, soft-updates) /dev/da0s1e on /var (ufs, local, soft-updates) /dev/da1s1d on /opt (ufs, local, soft-updates) procfs on /proc (procfs, local) fdescfs on /dev/fd (fdescfs) # df -hi /usr Filesystem Size Used Avail Capacity iused ifree %iused Mounted on /dev/da0s1f 104G 88G 7.5G 92% 283k 13M 2% /usr # umount /usr # mount /usr # du-hi /usr Filesystem Size Used Avail Capacity iused ifree %iused Mounted on /dev/da0s1f 104G 4.7G 91G 5% 278k 13M 2% /usr # ^D Script done on Fri Mar 15 19:09:26 2013 But, after a couple of hours in production again, after power-off + reboot (for other reasons, had to replace the remote console card, the iLO), an fsck indicates that it might still be losing file references occasionally? Look at the unreferenced files with size ~ 11111686. This is exactly how it looked before the unmount/remount, only there where many many more. # fsck /usr ** /dev/da0s1f (NO WRITE) ** Last Mounted on /usr ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames UNALLOCATED I=3157519 OWNER=pgsql MODE=100600 SIZE=0 MTIME=Mar 15 22:15 2013 FILE=/local/pgsql/data/base/16431/t79_3703656628 UNEXPECTED SOFT UPDATE INCONSISTENCY REMOVE? no UNALLOCATED I=7301569 OWNER=pgsql MODE=100600 SIZE=0 MTIME=Mar 15 22:15 2013 FILE=/local/pgsql/data/base/2969955511/t109_3703656671 UNEXPECTED SOFT UPDATE INCONSISTENCY REMOVE? no ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts UNREF FILE I=3156555 OWNER=pgsql MODE=100600 SIZE=11100363 MTIME=Mar 15 20:42 2013 CLEAR? no UNREF FILE I=3156714 OWNER=pgsql MODE=100600 SIZE=11126220 MTIME=Mar 15 20:08 2013 CLEAR? no UNREF FILE I=3157415 OWNER=pgsql MODE=100600 SIZE=11101546 MTIME=Mar 15 20:34 2013 CLEAR? no UNREF FILE I=3157512 OWNER=pgsql MODE=100600 SIZE=11100870 MTIME=Mar 15 20:39 2013 CLEAR? no UNREF FILE I=3157518 OWNER=pgsql MODE=100600 SIZE=11100194 MTIME=Mar 15 20:59 2013 CLEAR? no UNREF FILE I=3157520 OWNER=pgsql MODE=100600 SIZE=11098673 MTIME=Mar 15 21:58 2013 CLEAR? no UNREF FILE I=3157544 OWNER=pgsql MODE=100600 SIZE=11107123 MTIME=Mar 15 21:13 2013 CLEAR? no UNREF FILE I=3157547 OWNER=pgsql MODE=100600 SIZE=11110672 MTIME=Mar 15 21:54 2013 CLEAR? no UNREF FILE I=3157554 OWNER=pgsql MODE=100600 SIZE=11111686 MTIME=Mar 15 22:12 2013 CLEAR? no LINK COUNT FILE I=3157590 OWNER=pgsql MODE=0 SIZE=0 MTIME=Mar 15 22:15 2013 COUNT 0 SHOULD BE -1 ADJUST? no UNREF FILE I=3157596 OWNER=pgsql MODE=100600 SIZE=11107968 MTIME=Mar 15 20:48 2013 CLEAR? no UNREF FILE I=3157607 OWNER=pgsql MODE=100600 SIZE=11093096 MTIME=Mar 15 21:23 2013 CLEAR? no LINK COUNT FILE I=7301564 OWNER=pgsql MODE=0 SIZE=0 MTIME=Mar 15 22:15 2013 COUNT 0 SHOULD BE -2 ADJUST? no UNREF FILE I=8485378 OWNER=pgsql MODE=100600 SIZE=0 MTIME=Mar 15 22:15 2013 RECONNECT? no CLEAR? no ** Phase 5 - Check Cyl groups SUMMARY INFORMATION BAD SALVAGE? no ALLOCATED FRAGS 3416608-3416735 MARKED FREE BLK(S) MISSING IN BIT MAPS SALVAGE? no FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? no 278680 files, 2516552 used, 52158314 free (72842 frags, 6510684 blocks, 0.1% fragmentation) FreeBSD 9.1-RELEASE, amd64, GENERIC kernel. Any ideas? Best regards, Palle > > ----- Original Message: > > Date: Wed, 13 Mar 2013 11:23:13 +0100 From: Palle Girgensohn > To: freebsd-fs@freebsd.org Subject: leaking lots > of unreferenced inodes (pg_xlog files?), maybe after moving tables > and indexes to tablespace on different volume > > Hi! > > Running postgresql-9.2.2 on FreeBSD 9.1 amd64 using vanilla ufs file > system. > > I have the postgresql base/ on the /usr disk, and a separate volume > /opt where the default tablespace resides. This means that the amount > of data on the /usr disk sould be stable. This is not the case, the > disk usage grows linearly (it seems to leave many inodes > unreferenced). > > The the discrepancy between df and du is now huge: > > # du -sxh /usr; df -h /usr 4,6G /usr Filesystem Size Used > Avail Capacity Mounted on /dev/da0s1f 104G 88G 8.0G 92% > /usr > > 4,6G vs 88GB, that must be more than a rounding error? > > Strange thing is I cannot find any open files among the missing. > > # lsof /usr| awk '{print $9}'|xargs ls -l > /dev/null > > returns no errors (a missing file would render an error with ls). If > there where open files not referenced in any directory, they should > be found. > > Next thing is fsck, and yes, there are plenty of unreferenced files. > > I ran fsck while system is running (i.e. read only) to get a grip > oif the amount of lost inodes: > > fsck /usr | awk '{print $1}'|cut -f 2 -d=| perl -e '$i = 0; while > (<>) { $i += $_;}; print $i / 1024 / 1024; print "\n";' > 85223.3530330658 > > ~85 GB gone, that's 80% of the disk, and it accounts fo all the > missing space. > > MTIME for the inodes are pretty evenly spread over time since the > machine was updated to FreeBSD 9.1, rebooted, and PostgreSQL was > updated to 9.2. All was done at the same time, so I can't really tell > who's to blaim, but this is the only server, out of a dozen that > where updated to exactly the same versions, that has this problem. > All other servers have their /usr disk usage stable (since all data > resides on a separate tablespace). > > The unreferenced inodes are almost exclusively around 16 MB in size, > so they most certainly all are postgresql pg_xlog files. This means > all files are lost from the same portion of code in the database > engine. > > How could it possibly be able to leave unreferenced inodes around > like this at such a scale? Is the culprit a combination of postgresql > and file system code? Both where updated. > > pg_xlog checkpoints seems to happen approximately every three > minutes: > > Mar 13 00:39:08 dbserver postgres[5298]: [48-1] db=,user= LOG: > checkpoint starting: time Mar 13 00:41:38 dbserver postgres[5298]: > [49-1] db=,user= LOG: checkpoint complete: wrote 2542 buffers (0.3%); > 0 transaction log file(s) added, 0 removed, 1 recycled; write=149.667 > s, sync=0.101 s, total=149.770 s; sync files=628, longest=0.021 s, > average=0.000 s Mar 13 00:44:08 dbserver postgres[5298]: [50-1] > db=,user= LOG: checkpoint starting: time Mar 13 00:46:38 dbserver > postgres[5298]: [51-1] db=,user= LOG: checkpoint complete: wrote 3996 > buffers (0.4%); 0 transaction log file(s) added, 0 removed, 1 > recycled; write=149.438 s, sync=0.111 s, total=149.551 s; sync > files=823, longest=0.006 s, average=0.000 s Mar 13 00:49:08 dbserver > postgres[5298]: [52-1] db=,user= LOG: checkpoint starting: time Mar > 13 00:51:38 dbserver postgres[5298]: [53-1] db=,user= LOG: checkpoint > complete: wrote 13736 buffers (1.4%); 0 transaction log file(s) > added, 0 removed, 2 recycled; write=149.958 s, sync=0.311 s, > total=150.271 s; sync files=1335, longest=0.079 s, average=0.000 s > Mar 13 00:54:08 dbserver postgres[5298]: [54-1] db=,user= LOG: > checkpoint starting: time Mar 13 00:56:38 dbserver postgres[5298]: > [55-1] db=,user= LOG: checkpoint complete: wrote 14638 buffers > (1.5%); 0 transaction log file(s) added, 0 removed, 17 recycled; > write=149.330 s, sync=0.271 s, total=149.603 s; sync files=1363, > longest=0.017 s, average=0.000 s Mar 13 00:59:08 dbserver > postgres[5298]: [56-1] db=,user= LOG: checkpoint starting: time Mar > 13 01:01:38 dbserver postgres[5298]: [57-1] db=,user= LOG: checkpoint > complete: wrote 8035 buffers (0.8%); 0 transaction log file(s) added, > 0 removed, 21 recycled; write=149.285 s, sync=0.146 s, total=149.433 > s; sync files=1160, longest=0.003 s, average=0.000 s Mar 13 01:04:08 > dbserver postgres[5298]: [58-1] db=,user= LOG: checkpoint starting: > time Mar 13 01:06:37 dbserver postgres[5298]: [59-1] db=,user= LOG: > checkpoint complete: wrote 2156 buffers (0.2%); 0 transaction log > file(s) added, 0 removed, 9 recycled; write=149.402 s, sync=0.057 s, > total=149.461 s; sync files=610, longest=0.000 s, average=0.000 s Mar > 13 01:09:08 dbserver postgres[5298]: [60-1] db=,user= LOG: checkpoint > starting: time > > > I'm pretty certain that unmounting the file system and running fsck > will regain the lost space, but will it stop there? > > Stopping postgresql briefly did not help, I tried that. That would > have helped if the files where open, but they're not. It seems to > postgresql did the right thing, and FreeBSD failed to unreference the > files. > > The server has about 30 databases and ~127 concurrent connections > (not all beeing active simultaneously, though), so it is fair to say > it is pretty active, but nothing extreme. > > Hardware is HP DL360, using their HT Smart Array P410i. > > Any ideas how to debug this? Or shall I just reboot, fsck, hope the > problem will go away, and when it does, forget about it? > > Thanks, Palle -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 > v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org Comment: > Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iQEcBAEBAgAGBQJRQFORAAoJEIhV+7FrxBJDzVUIAJHU011JDxLxj8/xg05Gwhgq > XK3xB+0N0NSUQ50yhcRKLINz/j/XfeS0ZxlH+MstaPA9y0r1JUXMxkb/uTUvGBiy > jutk3eVe0cati9cVZbJkRU5FxEgmQ0fg0GOMl3RQAErkh5achj+klWvN7PnwGjTs > O3L9RgckKuxTJffk52GAS05qY/TKR6f08kdX3I2cFtqw3tyTyrXU0JPdk2snuPhv > H40xV46zgtWMFDvZLt61MryQ7/JotVQwU78scUB+zxrf8KKM9V0mM7pk0pIbG4Qw > NJBpZJ5gjbl4x+dkQrtZdL65yq88hACYwo9D+83Ct4ig8tgcQ7ViNHWxJqknK7Q= > =3ZZs -----END PGP SIGNATURE----- > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, > send any mail to "freebsd-fs-unsubscribe@freebsd.org" -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJRQ5JDAAoJEIhV+7FrxBJDuK8H/3gvtaZyKqNxbrQ+JkgGooit AVs5i38j6ZjKoYOPTNrqD5zsqk76NE5hUmJ2HAj/EkEt5CnPkR0trVN/s95NQu1S IY+iOlng9ImKHVvIEWKRap0WTeUu7BT2M+e6szOkOOo93xqS7E0U7tfwgkFXgjI2 MUcy7QxFz/Yfjyu7HrYDvJMCmCEL2e5SDRQoPXO/Qs4CRnE16d85nJtFJXuM8EgQ j8ZZmmphRt9yxxLg6tAlm3Tscf2QqXL8G4ABHSf32dJYuO11/7Glz+svh4m/gj7B YnlXuqOq7ESBMhwLpQqA78JOWfZiiF8B8aTQVlxm3GtjPWknm4rkK1XljWl8Zi8= =kIKS -----END PGP SIGNATURE----- From owner-freebsd-fs@FreeBSD.ORG Sat Mar 16 02:03:41 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 24892205; Sat, 16 Mar 2013 02:03:41 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 8E80DA50; Sat, 16 Mar 2013 02:03:40 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAFrSQ1GDaFvO/2dsb2JhbABDiDG6FYJlgX10gioBAQUjBFIbDgoCAg0ZAlkGLod5sHeSWoEjjT40B4ItgRMDlluRAoMmIIFs X-IronPort-AV: E=Sophos;i="4.84,855,1355115600"; d="scan'208";a="21484410" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 15 Mar 2013 22:03:39 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 14223B3F36; Fri, 15 Mar 2013 22:03:39 -0400 (EDT) Date: Fri, 15 Mar 2013 22:03:39 -0400 (EDT) From: Rick Macklem To: John Baldwin Message-ID: <88927360.3963361.1363399419023.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201303141444.35740.jhb@freebsd.org> Subject: Re: Deadlock in the NFS client MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Mar 2013 02:03:41 -0000 John Baldwin wrote: > On Thursday, March 14, 2013 1:22:39 pm Konstantin Belousov wrote: > > On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin wrote: > > > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote: > > > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote: > > > > > John Baldwin wrote: > > > > > > I ran into a machine that had a deadlock among certain files > > > > > > on a > > > > > > given NFS > > > > > > mount today. I'm not sure how best to resolve it, though it > > > > > > seems like > > > > > > perhaps there is a bug with how the pool of nfsiod threads > > > > > > is managed. > > > > > > Anyway, more details on the actual hang below. This was on > > > > > > 8.x with > > > > > > the > > > > > > old NFS client, but I don't see anything in HEAD that would > > > > > > fix this. > > > > > > > > > > > > First note that the system was idle so it had dropped down > > > > > > to only one > > > > > > nfsiod thread. > > > > > > > > > > > Hmm, I see the problem and I'm a bit surprised it doesn't bite > > > > > more often. > > > > > It seems to me that this snippet of code from nfs_asyncio() > > > > > makes too > > > > > weak an assumption: > > > > > /* > > > > > * If none are free, we may already have an iod working on > > > > > this mount > > > > > * point. If so, it will process our request. > > > > > */ > > > > > if (!gotiod) { > > > > > if (nmp->nm_bufqiods > 0) { > > > > > NFS_DPF(ASYNCIO, > > > > > ("nfs_asyncio: %d iods are already processing mount %p\n", > > > > > nmp->nm_bufqiods, nmp)); > > > > > gotiod = TRUE; > > > > > } > > > > > } > > > > > It assumes that, since an nfsiod thread is processing some > > > > > buffer for the > > > > > mount, it will become available to do this one, which isn't > > > > > true for your > > > > > deadlock. > > > > > > > > > > I think the simple fix would be to recode nfs_asyncio() so > > > > > that > > > > > it only returns 0 if it finds an AVAILABLE nfsiod thread that > > > > > it > > > > > has assigned to do the I/O, getting rid of the above. The > > > > > problem > > > > > with doing this is that it may result in a lot more > > > > > synchronous I/O > > > > > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe > > > > > more > > > > > synchronous I/O could be avoided by allowing nfs_asyncio() to > > > > > create a > > > > > new thread even if the total is above nfs_iodmax. (I think > > > > > this would > > > > > require the fixed array to be replaced with a linked list and > > > > > might > > > > > result in a large number of nfsiod threads.) Maybe just having > > > > > a large > > > > > nfs_iodmax would be an adequate compromise? > > > > > > > > > > Does having a large # of nfsiod threads cause any serious > > > > > problem for > > > > > most systems these days? > > > > > > > > > > I'd be tempted to recode nfs_asyncio() as above and then, > > > > > instead > > > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed > > > > > number of > > > > > nfsiod threads (this could be a tunable, with the > > > > > understanding that > > > > > it should be large for good performance) > > > > > > > > > > > > > I do not see how this would solve the deadlock itself. The > > > > proposal would > > > > only allow system to survive slightly longer after the deadlock > > > > appeared. > > > > And, I think that allowing the unbound amount of nfsiod threads > > > > is also > > > > fatal. > > > > > > > > The issue there is the LOR between buffer lock and vnode lock. > > > > Buffer lock > > > > always must come after the vnode lock. The problematic nfsiod > > > > thread, which > > > > locks the vnode, volatile this rule, because despite the > > > > LK_KERNPROC > > > > ownership of the buffer lock, it is the thread which de fact > > > > owns the > > > > buffer (only the thread can unlock it). > > > > > > > > A possible solution would be to pass LK_NOWAIT to nfs_nget() > > > > from the > > > > nfs_readdirplusrpc(). From my reading of the code, nfs_nget() > > > > should > > > > be capable of correctly handling the lock failure. And EBUSY > > > > would > > > > result in doit = 0, which should be fine too. > > > > > > > > It is possible that EBUSY should be reset to 0, though. > > > > > > Yes, thinking about this more, I do think the right answer is for > > > readdirplus to do this. The only question I have is if it should > > > do > > > this always, or if it should do this only from the nfsiod thread. > > > I > > > believe you can't get this in the non-nfsiod case. > > > > I agree that it looks as of the workaround only needed for nfsiod > > thread. > > On the other hand, it is not immediately obvious how to detect that > > the current thread is nfsio daemon. Probably a thread flag should be > > set. > > OTOH, updating the attributes from readdir+ is only an optimization > anyway, so > just having it always do LK_NOWAIT is probably ok (and simple). > Currently I'm > trying to develop a test case to provoke this so I can test the fix, > but no > luck on that yet. > > -- > John Baldwin Just fyi, ignore my comment about the second version of the patch that disables the nfsiod threads from doing readdirplus running faster. It was just that when I tested the 2nd patch, the server's caches were primed. Oops. However, sofar the minimal testing I've done has been essentially performance neutral between the unpatch and patched versions. Hopefully John has a convenient way to do some performance testing, since I won't be able to do much until the end of April. rick From owner-freebsd-fs@FreeBSD.ORG Sat Mar 16 04:01:34 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id E23D6477; Sat, 16 Mar 2013 04:01:34 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (chez.mckusick.com [IPv6:2001:5a8:4:7e72:4a5b:39ff:fe12:452]) by mx1.freebsd.org (Postfix) with ESMTP id BC714E4F; Sat, 16 Mar 2013 04:01:34 +0000 (UTC) Received: from chez.mckusick.com (localhost [127.0.0.1]) by chez.mckusick.com (8.14.3/8.14.3) with ESMTP id r2G41Um7026132; Fri, 15 Mar 2013 21:01:30 -0700 (PDT) (envelope-from mckusick@chez.mckusick.com) Message-Id: <201303160401.r2G41Um7026132@chez.mckusick.com> To: Palle Girgensohn Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?), maybe after moving tables and indexes to tablespace on different volume In-reply-to: <51439243.5020604@FreeBSD.org> Date: Fri, 15 Mar 2013 21:01:30 -0700 From: Kirk McKusick X-Spam-Status: No, score=0.0 required=5.0 tests=MISSING_MID, UNPARSEABLE_RELAY autolearn=failed version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on chez.mckusick.com Cc: freebsd-fs@FreeBSD.org, Jeff Roberson X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Mar 2013 04:01:34 -0000 I don't know how, but somehow something is holding references to the removed files causing them to fail to be reclaimed. Could you run your system for a while to build up a new set of these files, then run a script with the `df -ih' as before. Then run `vmstat -m', `sysctl debug', and fstat -f /usr' both before and after doing the umount/mount. Hopefully that will give us some more clues as to what is happening. And Jeff, if you have any ideas do speak up :-) Kirk McKusick From owner-freebsd-fs@FreeBSD.ORG Sat Mar 16 19:48:38 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 3E8BA78D for ; Sat, 16 Mar 2013 19:48:38 +0000 (UTC) (envelope-from marck@rinet.ru) Received: from woozle.rinet.ru (woozle.rinet.ru [195.54.192.68]) by mx1.freebsd.org (Postfix) with ESMTP id CC5D1FB9 for ; Sat, 16 Mar 2013 19:48:37 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by woozle.rinet.ru (8.14.5/8.14.5) with ESMTP id r2GJmQd4009240 for ; Sat, 16 Mar 2013 23:48:26 +0400 (MSK) (envelope-from marck@rinet.ru) Date: Sat, 16 Mar 2013 23:48:26 +0400 (MSK) From: Dmitry Morozovsky To: freebsd-fs@FreeBSD.org Subject: HA iSCSI target on ZFS: model Message-ID: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) X-NCC-RegID: ru.rinet X-OpenPGP-Key-ID: 6B691B03 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (woozle.rinet.ru [0.0.0.0]); Sat, 16 Mar 2013 23:48:26 +0400 (MSK) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Mar 2013 19:48:38 -0000 Dear colleagues, I'm currently plan to architect and deploy test HA iSCSI target using two FreeBSD hosts, and want to listen to your comments. What I already did: - two hosts, booted from internal USB stick, with 4 HDDs and one SSD each - LACP-based laggs while links are connected to different half of a clustered switch, with the mtu of 9000 - 2 carps (thanks to araujo@ help, -current, i.e. interface property, not clone) holding shared addresses - disk layout such as root@cthulhu4:/usr/local/etc/istgt# gpart show -l => 34 1953525101 ada0 GPT (931G) 34 2014 - free - (1M) 2048 1952448512 1 ct4-0 (931G) 1952450560 1074575 - free - (524M) => 34 1953522988 ada1 GPT (931G) 34 2014 - free - (1M) 2048 1952448512 1 ct4-1 (931G) 1952450560 1072462 - free - (523M) => 34 1953525101 ada2 GPT (931G) 34 2014 - free - (1M) 2048 1952448512 1 ct4-2 (931G) 1952450560 1074575 - free - (524M) => 34 1953525101 ada3 GPT (931G) 34 2014 - free - (1M) 2048 1952448512 1 ct4-3 (931G) 1952450560 1074575 - free - (524M) => 34 234441581 ada4 GPT (111G) 34 2014 - free - (1M) 2048 2097152 1 ct3-zil4 (1.0G) 2099200 2097152 2 ct4-zil4 (1.0G) 4196352 230244352 3 ct4-cache (109G) 234440704 911 - free - (455k) => 0 4005886 da0 BSD (1.9G) 0 16 - free - (8.0k) 16 4005870 1 (null) (1.9G) (da0 is USB, ada0-ada3 HDDs, ada4 SSD) - 2 hast sets such as root@cthulhu4:/usr/local/etc/istgt# hastctl status Name Status Role Components d0 complete secondary /dev/ada0p1 cthulhu3 d1 complete secondary /dev/ada1p1 cthulhu3 d2 complete primary /dev/ada2p1 cthulhu3 d3 complete primary /dev/ada3p1 cthulhu3 zil3 complete secondary /dev/ada4p1 cthulhu3 zil4 complete primary /dev/ada4p2 cthulhu3 - 2 ZFS setups like root@cthulhu4:/usr/local/etc/istgt# zpool status pool: ct4 state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Thu Mar 14 17:58:48 2013 config: NAME STATE READ WRITE CKSUM ct4 ONLINE 0 0 0 hast/d2 ONLINE 0 0 0 hast/d3 ONLINE 0 0 0 logs hast/zil4 ONLINE 0 0 0 cache gpt/ct4-cache ONLINE 0 0 0 Now, it's time to create exportable entities. I think of creating thin (non-preallocated) ZFS volumes in the pools, and sharing them via istgt. What zfs properties would be appropriate for this? I'm thinking at least about volblocksize=4k (main usage will be vSphere), but not sure about it. And, more importantly, what about sync property? Did I miss something obvious? (and yes, supporting scripts for checking paired resource to be alive are all to be written...) Thanks in advance! -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: marck@FreeBSD.org ] ------------------------------------------------------------------------ *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru *** ------------------------------------------------------------------------