From owner-freebsd-fs@FreeBSD.ORG Sun Aug 21 04:13:08 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DFD3A1065670; Sun, 21 Aug 2011 04:13:08 +0000 (UTC) (envelope-from jwd@SlowBlink.Com) Received: from nmail.slowblink.com (rrcs-24-199-145-34.midsouth.biz.rr.com [24.199.145.34]) by mx1.freebsd.org (Postfix) with ESMTP id 89FFA8FC08; Sun, 21 Aug 2011 04:13:08 +0000 (UTC) Received: from nmail.slowblink.com (localhost [127.0.0.1]) by nmail.slowblink.com (8.14.3/8.14.3) with ESMTP id p7L3bomJ039848; Sat, 20 Aug 2011 23:37:50 -0400 (EDT) (envelope-from jwd@nmail.slowblink.com) Received: (from jwd@localhost) by nmail.slowblink.com (8.14.3/8.14.3/Submit) id p7L3bond039847; Sat, 20 Aug 2011 23:37:50 -0400 (EDT) (envelope-from jwd) Date: Sat, 20 Aug 2011 23:37:50 -0400 From: John To: Current List Message-ID: <20110821033750.GA39626@slowblink.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.3i Cc: FS List Subject: nfs lock failure/hang when using alias address for server from linux X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Aug 2011 04:13:09 -0000 Hi, I have an nfs server running 9-current. Everything works as far as nfs i/o operations are concerned. From another FreeBSD box, nfs locking works great to the server when addressed by both it's real ip address and it's aliased ip address. From a Linux system: Linux bb05d6403.unx.sas.com 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux nfs locking works fine if the mount goes to the real ip address of the server. If, however, the server is mounted by using it's aliased ip address, while nfs i/o operations work fine, file locking hangs. On the server, the processes: root 5995 0.0 0.0 14272 1920 ?? Ss 3:48PM 0:05.33 /usr/sbin/rpcbind -h 10.24.6.38 -h 172.1.1.2 -h 10.24.6.33 -h 10.24.6.34 root 6021 0.0 0.0 12316 2364 ?? Ss 3:48PM 0:00.65 /usr/sbin/mountd -r -l -h 10.24.6.38 -h 172.1.1.2 -h 10.24.6.33 -h 10.24.6.34 root 6048 0.0 0.0 10060 1864 ?? Ss 3:48PM 0:00.10 nfsd: master (nfsd) root 6049 0.0 0.0 10060 1368 ?? S 3:48PM 0:00.20 nfsd: server (nfsd) root 6074 0.0 0.0 274432 2084 ?? Is 3:48PM 0:00.03 /usr/sbin/rpc.statd -d -h 10.24.6.38 -h 172.1.1.2 -h 10.24.6.33 -h 10.24.6.34 root 6099 0.0 0.0 14400 1780 ?? Ss 3:48PM 0:00.03 /usr/sbin/rpc.lockd -d 9 -h 10.24.6.38 -h 172.1.1.2 -h 10.24.6.33 -h 10.24.6.34 The server is accessed by udp in addition to tcp thus the -h options for each address. Nfsv4 is not enabled at this time. I have the debug output of statd & lockd running to /var/log via syslog but nothing useful shows up. The interface configuration: bce0: flags=8843 metric 0 mtu 1500 options=c01bb ether 84:2b:2b:fd:a1:fc inet 10.24.6.38 netmask 0xffff0000 broadcast 10.24.255.255 inet6 fe80::862b:2bff:fefd:a1fc%bce0 prefixlen 64 scopeid 0x1 inet 10.24.6.33 netmask 0xffffffff broadcast 10.24.255.255 inet 10.24.6.34 netmask 0xffffffff broadcast 10.24.255.255 nd6 options=29 media: Ethernet autoselect (1000baseT ) status: active Above, a mount to 10.24.6.38 works. A mount to either 10.24.6.33 or 10.24.6.34 works for nfs i/o operations, but hangs for lock requests. I'd like this to work so I can transistion some volumes around to different servers. Does anyone have any thoughts on the best way to debug this? I've looked at what I believe are the obvious areas. I'll probably start looking more closely at tcpdump next. Thanks, John From owner-freebsd-fs@FreeBSD.ORG Sun Aug 21 12:06:23 2011 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E6831106566B; Sun, 21 Aug 2011 12:06:23 +0000 (UTC) (envelope-from marck@rinet.ru) Received: from woozle.rinet.ru (woozle.rinet.ru [195.54.192.68]) by mx1.freebsd.org (Postfix) with ESMTP id 5AF538FC08; Sun, 21 Aug 2011 12:06:22 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by woozle.rinet.ru (8.14.4/8.14.4) with ESMTP id p7LBu3nv068699; Sun, 21 Aug 2011 15:56:03 +0400 (MSD) (envelope-from marck@rinet.ru) Date: Sun, 21 Aug 2011 15:56:03 +0400 (MSD) From: Dmitry Morozovsky To: freebsd-fs@FreeBSD.org Message-ID: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) X-NCC-RegID: ru.rinet X-OpenPGP-Key-ID: 6B691B03 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (woozle.rinet.ru [0.0.0.0]); Sun, 21 Aug 2011 15:56:03 +0400 (MSD) Cc: Pawel Jakub Dawidek , mm@FreeBSD.org Subject: strange ZFS v28 states after disk upgrades/rebuilds X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Aug 2011 12:06:24 -0000 Dear colleagues, I'm not sure how did I make this, but try to explain: my home file server was fresh 8-stable/amd64, booted from CF, and ZFS-root with 5x1.5T raidz + ssd as cache. raidz was built on raw disks ad4..ad12. ZFS v28 I'm starting to upgrade disks to Hitachi 3T, now with GPT on them. First change (ad12) worked seamlessly. Next two were not, some hangs, some reboots, some unables to import pool on boot (latter each time disappeard after booting single into CF /bootdisk, mount -u -w /, zpool import, reboot) after last reboot from single user I've got (BTW, resilvering seems to survive reboot, but can't report proper resilvering speed) -- 8< -- root@hamster:~# zpool status pool: hm state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sun Aug 21 13:12:38 2011 22.6G scanned out of 1.02T at 18/s, (scan is slow, no estimated time) 267M resilvered, 2.16% done config: NAME STATE READ WRITE CKSUM hm DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 replacing-2 DEGRADED 0 0 0 13001111841528871597 UNAVAIL 0 0 0 was /dev/ad8 gptid/3962b8a3-cb6d-11e0-a2b4-0007e90d0cbb ONLINE 0 0 0 (resilvering) replacing-3 DEGRADED 0 0 0 4143382663317400064 UNAVAIL 0 0 0 was /dev/ad10 6273508279307911610 UNAVAIL 0 0 0 was /dev/ad10 13164605370838846626 UNAVAIL 0 0 0 was /dev/ad10 gptid/fabf95d4-cb4a-11e0-bdbd-0007e90d0cbb ONLINE 0 0 0 (resilvering) gptid/9faf12fa-ca5b-11e0-b59d-0007e90d0cbb ONLINE 0 0 0 cache ad14h ONLINE 0 0 0 errors: 1 data errors, use '-v' for a list -- 8< -- after a couple of hours -- 8< -- root@hamster:~# zpool status -v pool: hm state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sun Aug 21 13:12:38 2011 697G scanned out of 1.02T at 569/s, (scan is slow, no estimated time) 1.50G resilvered, 66.59% done config: NAME STATE READ WRITE CKSUM hm DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 replacing-2 DEGRADED 0 0 0 13001111841528871597 UNAVAIL 0 0 0 was /dev/ad8 gptid/3962b8a3-cb6d-11e0-a2b4-0007e90d0cbb ONLINE 0 0 0 (resilvering) replacing-3 DEGRADED 0 0 0 4143382663317400064 UNAVAIL 0 0 0 was /dev/ad10 6273508279307911610 UNAVAIL 0 0 0 was /dev/ad10 13164605370838846626 UNAVAIL 0 0 0 was /dev/ad10 gptid/fabf95d4-cb4a-11e0-bdbd-0007e90d0cbb ONLINE 0 0 0 (resilvering) gptid/9faf12fa-ca5b-11e0-b59d-0007e90d0cbb ONLINE 0 0 0 cache ad14h ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /FreeBSD/ports.full/deskutils/horde-nag -- 8< -- then disk activity stops, and zpool is locked: -- 8< -- root@hamster:~# zpool status -v load: 0.00 cmd: zpool 6300 [spa_namespace_lock] 6.02r 0.00u 0.00s 0% 2032k -- 8< -- I have debugging kernel, and will be glad to produce more info to help reviving my pool, and hopefully avoids such sad situations in the future. Thanks in advance! -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: marck@FreeBSD.org ] ------------------------------------------------------------------------ *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru *** ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Sun Aug 21 12:50:06 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 253EC1065672; Sun, 21 Aug 2011 12:50:06 +0000 (UTC) (envelope-from marck@rinet.ru) Received: from woozle.rinet.ru (woozle.rinet.ru [195.54.192.68]) by mx1.freebsd.org (Postfix) with ESMTP id AC5728FC08; Sun, 21 Aug 2011 12:50:05 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by woozle.rinet.ru (8.14.4/8.14.4) with ESMTP id p7LCo43l069183; Sun, 21 Aug 2011 16:50:04 +0400 (MSD) (envelope-from marck@rinet.ru) Date: Sun, 21 Aug 2011 16:50:04 +0400 (MSD) From: Dmitry Morozovsky To: freebsd-fs@freebsd.org In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) X-NCC-RegID: ru.rinet X-OpenPGP-Key-ID: 6B691B03 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (woozle.rinet.ru [0.0.0.0]); Sun, 21 Aug 2011 16:50:04 +0400 (MSD) Cc: Pawel Jakub Dawidek Subject: Re: strange ZFS v28 states after disk upgrades/rebuilds X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Aug 2011 12:50:06 -0000 On Sun, 21 Aug 2011, Dmitry Morozovsky wrote: > Dear colleagues, > > I'm not sure how did I make this, but try to explain: > > my home file server was fresh 8-stable/amd64, booted from CF, and ZFS-root with > 5x1.5T raidz + ssd as cache. raidz was built on raw disks ad4..ad12. ZFS v28 > > I'm starting to upgrade disks to Hitachi 3T, now with GPT on them. First change > (ad12) worked seamlessly. Next two were not, some hangs, some reboots, some > unables to import pool on boot (latter each time disappeard after booting > single into CF /bootdisk, mount -u -w /, zpool import, reboot) > > after last reboot from single user I've got (BTW, resilvering seems to survive > reboot, but can't report proper resilvering speed) [snip] > > then disk activity stops, and zpool is locked: > > -- 8< -- > root@hamster:~# zpool status -v > load: 0.00 cmd: zpool 6300 [spa_namespace_lock] 6.02r 0.00u 0.00s 0% 2032k > -- 8< -- after hard reset, boot proceeds normally, but now I have similarly strange config: marck@hamster:~> zpool status pool: hm state: DEGRADED scan: resilvered 1.80G in 2h26m with 0 errors on Sun Aug 21 15:39:15 2011 config: NAME STATE READ WRITE CKSUM hm DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 replacing-2 DEGRADED 0 0 0 13001111841528871597 UNAVAIL 0 0 0 was /dev/ad8 gptid/3962b8a3-cb6d-11e0-a2b4-0007e90d0cbb ONLINE 0 0 0 replacing-3 DEGRADED 0 0 0 4143382663317400064 UNAVAIL 0 0 0 was /dev/ad10 6273508279307911610 UNAVAIL 0 0 0 was /dev/ad10 13164605370838846626 UNAVAIL 0 0 0 was /dev/ad10 gptid/fabf95d4-cb4a-11e0-bdbd-0007e90d0cbb ONLINE 0 0 0 gptid/9faf12fa-ca5b-11e0-b59d-0007e90d0cbb ONLINE 0 0 0 cache ad14h ONLINE 0 0 0 errors: No known data errors Relevant part of zpool history: 2011-08-16.22:13:58 zpool set autoexpand=on hm 2011-08-17.01:20:14 zpool scrub hm 2011-08-17.11:20:53 zpool clear hm 2011-08-19.16:07:16 zpool replace hm ad12 /dev/gpt/hm4 2011-08-20.00:35:51 zpool replace hm ad4 gpt/hm0 2011-08-20.03:38:47 zpool clear hm 2011-08-20.17:29:31 zpool import hm 2011-08-20.17:35:12 zpool offline hm ad10 2011-08-20.17:35:23 zpool online hm ad10 2011-08-20.17:41:15 zpool scrub -s hm 2011-08-20.17:42:40 zpool offline hm 6273508279307911610 2011-08-20.17:46:40 zpool scrub -s hm 2011-08-20.17:46:59 zpool replace hm ad10 /dev/gpt/hm3 2011-08-20.20:25:51 zpool offline hm ad10 2011-08-20.20:26:44 zpool online hm ad10 2011-08-20.20:26:57 zpool online hm 6273508279307911610 2011-08-20.20:27:42 zpool online hm gpt/hm3 2011-08-20.20:28:02 zpool offline hm gpt/hm3 2011-08-20.20:28:25 zpool online hm gpt/hm3 2011-08-20.20:30:27 zpool scrub -s hm 2011-08-20.20:34:19 zpool export hm 2011-08-20.20:34:51 zpool import hm 2011-08-20.20:40:02 zpool replace hm /dev/ad10 /dev/gpt/hm3 2011-08-20.23:47:14 zpool import hm 2011-08-21.02:02:03 zpool replace hm /dev/ad8 /dev/gpt/hm2 2011-08-21.13:12:38 zpool import hm -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: marck@FreeBSD.org ] ------------------------------------------------------------------------ *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru *** ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Sun Aug 21 13:54:39 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E3ECC106564A; Sun, 21 Aug 2011 13:54:38 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 8A89F8FC15; Sun, 21 Aug 2011 13:54:38 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap8EAO4NUU6DaFvO/2dsb2JhbABBhEukPYFAAQEBAQMBAQEgKyALGxgCAg0ZAikBCSYGCAcEARwEh1SlAJA3gSyEDIEQBJEGgg6REg X-IronPort-AV: E=Sophos;i="4.68,258,1312171200"; d="scan'208";a="135068216" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 21 Aug 2011 09:54:37 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 68F47B3F3E; Sun, 21 Aug 2011 09:54:37 -0400 (EDT) Date: Sun, 21 Aug 2011 09:54:37 -0400 (EDT) From: Rick Macklem To: John Message-ID: <1679158213.124383.1313934877377.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <20110821033750.GA39626@slowblink.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: FS List , Current List Subject: Re: nfs lock failure/hang when using alias address for server from linux X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Aug 2011 13:54:39 -0000 John De wrote: > Hi, > > I have an nfs server running 9-current. Everything works as far > as nfs i/o operations are concerned. > > From another FreeBSD box, nfs locking works great to the server > when addressed by both it's real ip address and it's aliased ip > address. > > From a Linux system: > > Linux bb05d6403.unx.sas.com 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May > 10 15:42:40 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux > > nfs locking works fine if the mount goes to the real ip address > of the server. If, however, the server is mounted by using it's > aliased > ip address, while nfs i/o operations work fine, file locking hangs. > > On the server, the processes: > > root 5995 0.0 0.0 14272 1920 ?? Ss 3:48PM 0:05.33 /usr/sbin/rpcbind -h > 10.24.6.38 -h 172.1.1.2 -h 10.24.6.33 -h 10.24.6.34 > root 6021 0.0 0.0 12316 2364 ?? Ss 3:48PM 0:00.65 /usr/sbin/mountd -r > -l -h 10.24.6.38 -h 172.1.1.2 -h 10.24.6.33 -h 10.24.6.34 > root 6048 0.0 0.0 10060 1864 ?? Ss 3:48PM 0:00.10 nfsd: master (nfsd) > root 6049 0.0 0.0 10060 1368 ?? S 3:48PM 0:00.20 nfsd: server (nfsd) > root 6074 0.0 0.0 274432 2084 ?? Is 3:48PM 0:00.03 /usr/sbin/rpc.statd > -d -h 10.24.6.38 -h 172.1.1.2 -h 10.24.6.33 -h 10.24.6.34 > root 6099 0.0 0.0 14400 1780 ?? Ss 3:48PM 0:00.03 /usr/sbin/rpc.lockd > -d 9 -h 10.24.6.38 -h 172.1.1.2 -h 10.24.6.33 -h 10.24.6.34 > > The server is accessed by udp in addition to tcp thus the -h > options for each address. Nfsv4 is not enabled at this time. I have > the debug output of statd & lockd running to /var/log via syslog but > nothing useful shows up. > > The interface configuration: > > bce0: flags=8843 metric 0 mtu > 1500 > options=c01bb > ether 84:2b:2b:fd:a1:fc > inet 10.24.6.38 netmask 0xffff0000 broadcast 10.24.255.255 > inet6 fe80::862b:2bff:fefd:a1fc%bce0 prefixlen 64 scopeid 0x1 > inet 10.24.6.33 netmask 0xffffffff broadcast 10.24.255.255 > inet 10.24.6.34 netmask 0xffffffff broadcast 10.24.255.255 > nd6 options=29 > media: Ethernet autoselect (1000baseT ) > status: active > > Above, a mount to 10.24.6.38 works. A mount to either 10.24.6.33 > or 10.24.6.34 works for nfs i/o operations, but hangs for lock > requests. > > I'd like this to work so I can transistion some volumes around to > different servers. > > Does anyone have any thoughts on the best way to debug this? I've > looked > at what I believe are the obvious areas. I'll probably start looking > more > closely at tcpdump next. > I think you will probably need to capture packets and take a look. (wireshark interprets the NFS stuff much better than tcpdump, although tcpdump is fine for the capture part) A wild guess is that it will be something like: - Linux client sends an IP broadcast (those Sun RPC protocols love to do that) - FreeBSD server replies via main address and not alias - Linux client doesn`t handle reply that isn`t from the address used for the mount. (You might poke around on the Linux side, in case there is some option or sysctl that affects what addresses their lockd can handle.) rick > Thanks, > John > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to > "freebsd-current-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Sun Aug 21 14:02:47 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BD196106566B for ; Sun, 21 Aug 2011 14:02:47 +0000 (UTC) (envelope-from universite@ukr.net) Received: from otrada.od.ua (universite-1-pt.tunnel.tserv24.sto1.ipv6.he.net [IPv6:2001:470:27:140::2]) by mx1.freebsd.org (Postfix) with ESMTP id 2BEBA8FC08 for ; Sun, 21 Aug 2011 14:02:46 +0000 (UTC) Received: from [IPv6:2001:470:28:140:e81f:f334:8b7c:58c8] ([IPv6:2001:470:28:140:e81f:f334:8b7c:58c8]) (authenticated bits=0) by otrada.od.ua (8.14.4/8.14.5) with ESMTP id p7LE2gVg022556 for ; Sun, 21 Aug 2011 17:02:42 +0300 (EEST) (envelope-from universite@ukr.net) Message-ID: <4E510FEE.3010101@ukr.net> Date: Sun, 21 Aug 2011 17:02:22 +0300 From: "Vladislav V. Prodan" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.20) Gecko/20110804 Thunderbird/3.1.12 MIME-Version: 1.0 To: freebsd-fs@freebsd.org References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-95.5 required=5.0 tests=FREEMAIL_FROM,FSL_RU_URL, RDNS_NONE, SPF_SOFTFAIL, T_TO_NO_BRKTS_FREEMAIL, USER_IN_WHITELIST autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mary-teresa.otrada.od.ua X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (otrada.od.ua [IPv6:2001:470:28:140::5]); Sun, 21 Aug 2011 17:02:45 +0300 (EEST) Subject: Re: strange ZFS v28 states after disk upgrades/rebuilds X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Aug 2011 14:02:47 -0000 21.08.2011 15:50, Dmitry Morozovsky wrote: > after hard reset, boot proceeds normally, but now I have similarly strange > config: Try running: zpool import -F if not helps: zppol export zpool import -fFX -- Vladislav V. Prodan VVP24-UANIC +380[67]4584408 +380[99]4060508 xmpp:vlad11@jabber.ru From owner-freebsd-fs@FreeBSD.ORG Sun Aug 21 14:06:21 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E69DF106564A for ; Sun, 21 Aug 2011 14:06:21 +0000 (UTC) (envelope-from marck@rinet.ru) Received: from woozle.rinet.ru (woozle.rinet.ru [195.54.192.68]) by mx1.freebsd.org (Postfix) with ESMTP id 60CE98FC0C for ; Sun, 21 Aug 2011 14:06:20 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by woozle.rinet.ru (8.14.4/8.14.4) with ESMTP id p7LE6Jwl069827; Sun, 21 Aug 2011 18:06:19 +0400 (MSD) (envelope-from marck@rinet.ru) Date: Sun, 21 Aug 2011 18:06:19 +0400 (MSD) From: Dmitry Morozovsky To: "Vladislav V. Prodan" In-Reply-To: <4E510FEE.3010101@ukr.net> Message-ID: References: <4E510FEE.3010101@ukr.net> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) X-NCC-RegID: ru.rinet X-OpenPGP-Key-ID: 6B691B03 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (woozle.rinet.ru [0.0.0.0]); Sun, 21 Aug 2011 18:06:19 +0400 (MSD) Cc: freebsd-fs@freebsd.org Subject: Re: strange ZFS v28 states after disk upgrades/rebuilds X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Aug 2011 14:06:22 -0000 On Sun, 21 Aug 2011, Vladislav V. Prodan wrote: > > after hard reset, boot proceeds normally, but now I have similarly strange > > config: > > > Try running: > zpool import -F > > if not helps: > zppol export > zpool import -fFX hm, I did not found reference to -F option in zpool manual page, nor much references on the web -- could you please sched me a light about it? For now, i'm rsyncing the pool to external eSATA disk, and then would be ready to experiment further. Thanks! -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: marck@FreeBSD.org ] ------------------------------------------------------------------------ *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru *** ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Sun Aug 21 14:23:19 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E3561106566C for ; Sun, 21 Aug 2011 14:23:19 +0000 (UTC) (envelope-from marck@rinet.ru) Received: from woozle.rinet.ru (woozle.rinet.ru [195.54.192.68]) by mx1.freebsd.org (Postfix) with ESMTP id 628E78FC14 for ; Sun, 21 Aug 2011 14:23:18 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by woozle.rinet.ru (8.14.4/8.14.4) with ESMTP id p7LENHaZ069951; Sun, 21 Aug 2011 18:23:17 +0400 (MSD) (envelope-from marck@rinet.ru) Date: Sun, 21 Aug 2011 18:23:17 +0400 (MSD) From: Dmitry Morozovsky To: "Vladislav V. Prodan" In-Reply-To: Message-ID: References: <4E510FEE.3010101@ukr.net> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) X-NCC-RegID: ru.rinet X-OpenPGP-Key-ID: 6B691B03 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (woozle.rinet.ru [0.0.0.0]); Sun, 21 Aug 2011 18:23:18 +0400 (MSD) Cc: freebsd-fs@freebsd.org Subject: Re: strange ZFS v28 states after disk upgrades/rebuilds X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Aug 2011 14:23:20 -0000 On Sun, 21 Aug 2011, Dmitry Morozovsky wrote: > On Sun, 21 Aug 2011, Vladislav V. Prodan wrote: > > > > after hard reset, boot proceeds normally, but now I have similarly strange > > > config: > > > > > > Try running: > > zpool import -F > > > > if not helps: > > zppol export > > zpool import -fFX > > hm, I did not found reference to -F option in zpool manual page, nor much > references on the web -- could you please sched me a light about it? Ah, I see from commit message: - zpool import -F. Allows to rewind corrupted pool to earlier transaction group. However, the pool does not seem to be corrupted, at least its contents; after backup is finished, I'll try to reassure by running scrub. -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: marck@FreeBSD.org ] ------------------------------------------------------------------------ *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru *** ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Sun Aug 21 14:35:32 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6B9751065675 for ; Sun, 21 Aug 2011 14:35:32 +0000 (UTC) (envelope-from universite@ukr.net) Received: from otrada.od.ua (universite-1-pt.tunnel.tserv24.sto1.ipv6.he.net [IPv6:2001:470:27:140::2]) by mx1.freebsd.org (Postfix) with ESMTP id B25768FC18 for ; Sun, 21 Aug 2011 14:35:31 +0000 (UTC) Received: from [IPv6:2001:470:28:140:e81f:f334:8b7c:58c8] ([IPv6:2001:470:28:140:e81f:f334:8b7c:58c8]) (authenticated bits=0) by otrada.od.ua (8.14.4/8.14.5) with ESMTP id p7LEZPN0030021 for ; Sun, 21 Aug 2011 17:35:26 +0300 (EEST) (envelope-from universite@ukr.net) Message-ID: <4E511799.3020106@ukr.net> Date: Sun, 21 Aug 2011 17:35:05 +0300 From: "Vladislav V. Prodan" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.20) Gecko/20110804 Thunderbird/3.1.12 MIME-Version: 1.0 CC: freebsd-fs@freebsd.org References: <4E510FEE.3010101@ukr.net> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-94.3 required=5.0 tests=FREEMAIL_FROM,FSL_RU_URL, MISSING_HEADERS,RDNS_NONE,SPF_SOFTFAIL,T_TO_NO_BRKTS_FREEMAIL, USER_IN_WHITELIST autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mary-teresa.otrada.od.ua X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (otrada.od.ua [IPv6:2001:470:28:140::5]); Sun, 21 Aug 2011 17:35:30 +0300 (EEST) Subject: Re: strange ZFS v28 states after disk upgrades/rebuilds X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Aug 2011 14:35:32 -0000 21.08.2011 17:23, Dmitry Morozovsky wrote: > Ah, I see from commit message: > > - zpool import -F. Allows to rewind corrupted pool to earlier > transaction group. > > However, the pool does not seem to be corrupted, at least its contents; after > backup is finished, I'll try to reassure by running scrub. > In your case, the pool in a state of "DEGRADED" and it can only be repaired by stripping transaction history. Sometimes it helps to zpool scrub , but only in the case of loose HDD. -- Vladislav V. Prodan VVP24-UANIC +380[67]4584408 +380[99]4060508 xmpp:vlad11@jabber.ru From owner-freebsd-fs@FreeBSD.ORG Sun Aug 21 20:44:42 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BC7AC106564A for ; Sun, 21 Aug 2011 20:44:42 +0000 (UTC) (envelope-from marck@rinet.ru) Received: from woozle.rinet.ru (woozle.rinet.ru [195.54.192.68]) by mx1.freebsd.org (Postfix) with ESMTP id 04F178FC12 for ; Sun, 21 Aug 2011 20:44:41 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by woozle.rinet.ru (8.14.4/8.14.4) with ESMTP id p7LKiexV073985; Mon, 22 Aug 2011 00:44:40 +0400 (MSD) (envelope-from marck@rinet.ru) Date: Mon, 22 Aug 2011 00:44:40 +0400 (MSD) From: Dmitry Morozovsky To: "Vladislav V. Prodan" In-Reply-To: <4E511799.3020106@ukr.net> Message-ID: References: <4E510FEE.3010101@ukr.net> <4E511799.3020106@ukr.net> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) X-NCC-RegID: ru.rinet X-OpenPGP-Key-ID: 6B691B03 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (woozle.rinet.ru [0.0.0.0]); Mon, 22 Aug 2011 00:44:40 +0400 (MSD) Cc: freebsd-fs@freebsd.org Subject: Re: strange ZFS v28 states after disk upgrades/rebuilds X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Aug 2011 20:44:42 -0000 On Sun, 21 Aug 2011, Vladislav V. Prodan wrote: > > - zpool import -F. Allows to rewind corrupted pool to earlier > > transaction group. > > > > However, the pool does not seem to be corrupted, at least its contents; > > after > > backup is finished, I'll try to reassure by running scrub. > > > > In your case, the pool in a state of "DEGRADED" and it can only be repaired by > stripping transaction history. > Sometimes it helps to zpool scrub , but only in the case of loose HDD. FWIW, replace the component with only one UNAVAIL line has been finished ok (UNAVAIL line disappeared together with "replacing" one); and for strange set of multiple UNAVAIL I'd overcome this by `zpool detach' all UNAVAIL components but one and then `zpool replace' the last one. What bothers me is (actually, two things, hence are): - how this situation could have place at the first time - what leads to locks requiring hard resets Unfortunately, this motherboard has dead com port, so no much luck for the serial console -- or can it be assigned to one of puc() card I have? -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: marck@FreeBSD.org ] ------------------------------------------------------------------------ *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru *** ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Sun Aug 21 22:20:17 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 076681065687; Sun, 21 Aug 2011 22:20:17 +0000 (UTC) (envelope-from delphij@gmail.com) Received: from mail-gx0-f182.google.com (mail-gx0-f182.google.com [209.85.161.182]) by mx1.freebsd.org (Postfix) with ESMTP id AB4658FC18; Sun, 21 Aug 2011 22:20:16 +0000 (UTC) Received: by gxk28 with SMTP id 28so3633359gxk.13 for ; Sun, 21 Aug 2011 15:20:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=DMQGuerv4r+Nu/LoM6pgHPD6z8BWpZ6KRFhS2RlC4/s=; b=F3VcHl0bUOO+yV/EohYfp3v11i5/Fh8zrG9bkDMd10IeIhWpBwr/712C2PEUzZ1Me3 x3fEsPqVy8FVgAreQVGKZ3PZJlG3QJ+BOY6TM9b7J5yEGIlz3L9Ste7L0VLNnqXuGqxg 2YYNRbEWakrxsC0nUpDk3PxT+av17aWtofvE8= MIME-Version: 1.0 Received: by 10.150.65.19 with SMTP id n19mr1729592yba.50.1313963794702; Sun, 21 Aug 2011 14:56:34 -0700 (PDT) Received: by 10.150.136.11 with HTTP; Sun, 21 Aug 2011 14:56:34 -0700 (PDT) In-Reply-To: References: Date: Sun, 21 Aug 2011 14:56:34 -0700 Message-ID: From: Xin LI To: Dmitry Morozovsky Content-Type: text/plain; charset=UTF-8 Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek Subject: Re: strange ZFS v28 states after disk upgrades/rebuilds X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Aug 2011 22:20:17 -0000 Hi, Dmitry, Pawel have recently committed a fix to ZFS which is likely to be related to your problem. Could you please try commit 224791 or -HEAD newer than Aug 12 2011 07:04:16 2011 UTC? Cheers, -- Xin LI https://www.delphij.net/ FreeBSD - The Power to Serve! Live free or die From owner-freebsd-fs@FreeBSD.ORG Mon Aug 22 02:16:21 2011 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 97C21106566B; Mon, 22 Aug 2011 02:16:21 +0000 (UTC) (envelope-from linimon@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 701EA8FC15; Mon, 22 Aug 2011 02:16:21 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p7M2GLxD072395; Mon, 22 Aug 2011 02:16:21 GMT (envelope-from linimon@freefall.freebsd.org) Received: (from linimon@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p7M2GL0b072391; Mon, 22 Aug 2011 02:16:21 GMT (envelope-from linimon) Date: Mon, 22 Aug 2011 02:16:21 GMT Message-Id: <201108220216.p7M2GL0b072391@freefall.freebsd.org> To: linimon@FreeBSD.org, freebsd-i386@FreeBSD.org, freebsd-fs@FreeBSD.org From: linimon@FreeBSD.org Cc: Subject: Re: kern/159971: [ffs] [panic] panic with soft updates journaling during load testing X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Aug 2011 02:16:21 -0000 Old Synopsis: panic with soft updates journaling during load testing New Synopsis: [ffs] [panic] panic with soft updates journaling during load testing Responsible-Changed-From-To: freebsd-i386->freebsd-fs Responsible-Changed-By: linimon Responsible-Changed-When: Mon Aug 22 02:15:14 UTC 2011 Responsible-Changed-Why: Over to maintainer(s). http://www.freebsd.org/cgi/query-pr.cgi?pr=159971 From owner-freebsd-fs@FreeBSD.ORG Mon Aug 22 09:03:50 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 84D7D106566B; Mon, 22 Aug 2011 09:03:50 +0000 (UTC) (envelope-from marck@rinet.ru) Received: from woozle.rinet.ru (woozle.rinet.ru [195.54.192.68]) by mx1.freebsd.org (Postfix) with ESMTP id 04B738FC12; Mon, 22 Aug 2011 09:03:49 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by woozle.rinet.ru (8.14.4/8.14.4) with ESMTP id p7M93m6m095413; Mon, 22 Aug 2011 13:03:48 +0400 (MSD) (envelope-from marck@rinet.ru) Date: Mon, 22 Aug 2011 13:03:48 +0400 (MSD) From: Dmitry Morozovsky To: Xin LI In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) X-NCC-RegID: ru.rinet X-OpenPGP-Key-ID: 6B691B03 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (woozle.rinet.ru [0.0.0.0]); Mon, 22 Aug 2011 13:03:48 +0400 (MSD) Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek Subject: Re: strange ZFS v28 states after disk upgrades/rebuilds X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Aug 2011 09:03:50 -0000 Dear Xin, On Sun, 21 Aug 2011, Xin LI wrote: > Pawel have recently committed a fix to ZFS which is likely to be > related to your problem. Could you please try commit 224791 or -HEAD > newer than Aug 12 2011 07:04:16 2011 UTC? Yep, this at least seems to be related. However, now I'm in te middle of the last `zpool replace', and reproducing the deadlock is somewhat non-linear. Thanks! -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: marck@FreeBSD.org ] ------------------------------------------------------------------------ *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru *** ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Mon Aug 22 09:09:12 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2D329106564A; Mon, 22 Aug 2011 09:09:12 +0000 (UTC) (envelope-from mm@FreeBSD.org) Received: from mail.vx.sk (mail.vx.sk [IPv6:2a01:4f8:100:1043::3]) by mx1.freebsd.org (Postfix) with ESMTP id B8D328FC14; Mon, 22 Aug 2011 09:09:11 +0000 (UTC) Received: from core.vx.sk (localhost [127.0.0.1]) by mail.vx.sk (Postfix) with ESMTP id D5B9319584B; Mon, 22 Aug 2011 11:09:10 +0200 (CEST) X-Virus-Scanned: amavisd-new at mail.vx.sk Received: from mail.vx.sk ([127.0.0.1]) by core.vx.sk (mail.vx.sk [127.0.0.1]) (amavisd-new, port 10024) with LMTP id Tv2mKp0FybfH; Mon, 22 Aug 2011 11:09:08 +0200 (CEST) Received: from [10.0.3.3] (188-167-66-148.dynamic.chello.sk [188.167.66.148]) by mail.vx.sk (Postfix) with ESMTPSA id 03DF219583D; Mon, 22 Aug 2011 11:09:06 +0200 (CEST) Message-ID: <4E521CB0.3050806@FreeBSD.org> Date: Mon, 22 Aug 2011 11:09:04 +0200 From: Martin Matuska User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20110812 Thunderbird/6.0 MIME-Version: 1.0 To: Dmitry Morozovsky References: In-Reply-To: X-Enigmail-Version: 1.3 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek Subject: Re: strange ZFS v28 states after disk upgrades/rebuilds X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Aug 2011 09:09:12 -0000 I suggest it may get MFCed soon, as the 1 week testing period is already over. On 22. 8. 2011 11:03, Dmitry Morozovsky wrote: > Dear Xin, > > On Sun, 21 Aug 2011, Xin LI wrote: > >> Pawel have recently committed a fix to ZFS which is likely to be >> related to your problem. Could you please try commit 224791 or -HEAD >> newer than Aug 12 2011 07:04:16 2011 UTC? > Yep, this at least seems to be related. > > However, now I'm in te middle of the last `zpool replace', and > reproducing the deadlock is somewhat non-linear. > > Thanks! > -- Martin Matuska FreeBSD committer http://blog.vx.sk From owner-freebsd-fs@FreeBSD.ORG Mon Aug 22 10:15:17 2011 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 85551106566B; Mon, 22 Aug 2011 10:15:17 +0000 (UTC) (envelope-from mm@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 5DB778FC0A; Mon, 22 Aug 2011 10:15:17 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p7MAFHQJ048674; Mon, 22 Aug 2011 10:15:17 GMT (envelope-from mm@freefall.freebsd.org) Received: (from mm@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p7MAFHpi048670; Mon, 22 Aug 2011 10:15:17 GMT (envelope-from mm) Date: Mon, 22 Aug 2011 10:15:17 GMT Message-Id: <201108221015.p7MAFHpi048670@freefall.freebsd.org> To: mm@FreeBSD.org, mm@FreeBSD.org, freebsd-fs@FreeBSD.org From: mm@FreeBSD.org Cc: Subject: Re: kern/157728: [zfs] zfs (v28) incremental receive may leave behind temporary clones X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Aug 2011 10:15:17 -0000 Synopsis: [zfs] zfs (v28) incremental receive may leave behind temporary clones State-Changed-From-To: open->closed State-Changed-By: mm State-Changed-When: Mon Aug 22 10:15:16 UTC 2011 State-Changed-Why: Resolved. Thanks! http://www.freebsd.org/cgi/query-pr.cgi?pr=157728 From owner-freebsd-fs@FreeBSD.ORG Mon Aug 22 11:02:16 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 20271106571A for ; Mon, 22 Aug 2011 11:02:16 +0000 (UTC) (envelope-from luke@digital-crocus.com) Received: from mail.digital-crocus.com (node2.digital-crocus.com [91.209.244.128]) by mx1.freebsd.org (Postfix) with ESMTP id CDCFD8FC08 for ; Mon, 22 Aug 2011 11:02:15 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dkselector; d=hybrid-logic.co.uk; h=Received:Received:Subject:From:Reply-To:To:In-Reply-To:References:Content-Type:Organization:Date:Message-ID:Mime-Version:X-Mailer:Content-Transfer-Encoding:X-Spam-Score:X-Digital-Crocus-Maillimit:X-Authenticated-Sender:X-Complaints:X-Admin:X-Abuse; b=IeMj7+jmeGDbthx1M3z1l0ERikzEJRvOE0KvIsvvMZG+xtN061ZYZbExJvjYMyc44WI1kgyz7ISWP7x8KbYqjLKoRw330lHUbRDwq7SaEj6JxGfmHICsnqIdbyW65g2g; Received: from luke by mail.digital-crocus.com with local (Exim 4.69 (FreeBSD)) (envelope-from ) id 1QvSFv-00032h-HY for freebsd-fs@freebsd.org; Mon, 22 Aug 2011 12:01:23 +0100 Received: from 127cr.net ([78.105.122.99] helo=[192.168.1.23]) by mail.digital-crocus.com with esmtpa (Exim 4.69 (FreeBSD)) (envelope-from ) id 1QvSFv-00032b-A8 for freebsd-fs@freebsd.org; Mon, 22 Aug 2011 12:01:23 +0100 From: Luke Marsden To: freebsd-fs@freebsd.org In-Reply-To: <201108221015.p7MAFHpi048670@freefall.freebsd.org> References: <201108221015.p7MAFHpi048670@freefall.freebsd.org> Content-Type: text/plain; charset="UTF-8" Organization: Hybrid Web Cluster Date: Mon, 22 Aug 2011 12:02:11 +0100 Message-ID: <1314010931.3477.138.camel@pow> Mime-Version: 1.0 X-Mailer: Evolution 2.32.2 Content-Transfer-Encoding: 7bit X-Spam-Score: -1.0 X-Digital-Crocus-Maillimit: done X-Authenticated-Sender: luke X-Complaints: abuse@digital-crocus.com X-Admin: admin@digital-crocus.com X-Abuse: abuse@digital-crocus.com (Please include full headers in abuse reports) Subject: Re: kern/157728: [zfs] zfs (v28) incremental receive may leave behind temporary clones X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: luke@hybrid-logic.co.uk List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Aug 2011 11:02:16 -0000 On Mon, 2011-08-22 at 10:15 +0000, mm@FreeBSD.org wrote: > Synopsis: [zfs] zfs (v28) incremental receive may leave behind temporary clones > > State-Changed-From-To: open->closed > State-Changed-By: mm > State-Changed-When: Mon Aug 22 10:15:16 UTC 2011 > State-Changed-Why: > Resolved. Thanks! Brilliant, thanks for fixing this! Do you have any thoughts about what might have caused the other issue I reported, the deadlock? From my email of the 15th July (mfsbsd-se-8.2-zfsv28-amd64 19.06.2011): The biggest issue was a DEADLOCK which occurs quite reliably with a given sequence of events in short succession, on a chroot filesystem with many snapshots and a MySQL socket and nullfs mounts inside it: 1. Force unmount the nullfs mounts which are mounted on top of it 2. Close the MySQL socket in /tmp 3. Force unmount the actual filesystem (even if there are open FDs) 4. 'zfs rename' the filesystem into our 'trash' filesystem (which I understand consists of a clone, promote and destroy) The entire ZFS subsystem then hangs on any new I/O. Here is a procstat of the zfs rename process which hangs after the force unmount: 25674 100871 zfs initial thread mi_switch+0x176 sleepq_wait+0x42 _cv_wait+0x129 txg_wait_synced+0x85 dsl_sync_task_group_wait+0x128 dsl_sync_task_do+0x54 dsl_dir_rename+0x8f dsl_dataset_rename+0x272 zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl +0x102 ioctl+0xfd syscallenter+0x1e5 syscall+0x4b Xfast_syscall+0xe2 Unfortunately it's not easy to reproduce, it only seems to happen in an environment which is under load with a lot of datasets and a lot of zfs operations happening concurrently on other datasets. I spent two days trying to reproduce it in self-contained test environments but had no luck, so I'm now reporting it anyway. -- Best Regards, Luke Marsden CTO, Hybrid Logic Ltd. Web: http://www.hybrid-cluster.com/ Hybrid Web Cluster - cloud web hosting Mobile: +1-415-449-1165 (US) / +447791750420 (UK) From owner-freebsd-fs@FreeBSD.ORG Mon Aug 22 11:07:01 2011 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 59B2010656B6 for ; Mon, 22 Aug 2011 11:07:01 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 47A3B8FC14 for ; Mon, 22 Aug 2011 11:07:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p7MB71JQ097126 for ; Mon, 22 Aug 2011 11:07:01 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p7MB70V1097124 for freebsd-fs@FreeBSD.org; Mon, 22 Aug 2011 11:07:00 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 22 Aug 2011 11:07:00 GMT Message-Id: <201108221107.p7MB70V1097124@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-fs@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Aug 2011 11:07:01 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/159971 fs [ffs] [panic] panic with soft updates journaling durin o kern/159930 fs [ufs] [panic] kernel core o kern/159418 fs [tmpfs] [panic] tmpfs kernel panic: recursing on non r o kern/159402 fs [zfs][loader] symlinks cause I/O errors o kern/159357 fs [zfs] ZFS MAXNAMELEN macro has confusing name (off-by- o kern/159356 fs [zfs] [patch] ZFS NAME_ERR_DISKLIKE check is Solaris-s o kern/159351 fs [nfs] [patch] - divide by zero in mountnfs() o kern/159251 fs [zfs] [request]: add FLETCHER4 as DEDUP hash option o kern/159233 fs [ext2fs] [patch] fs/ext2fs: finish reallocblk implemen o kern/159232 fs [ext2fs] [patch] fs/ext2fs: merge ext2_readwrite into o kern/159077 fs [zfs] Can't cd .. with latest zfs version o kern/159048 fs [smbfs] smb mount corrupts large files o kern/159045 fs [zfs] [hang] ZFS scrub freezes system o kern/158839 fs [zfs] ZFS Bootloader Fails if there is a Dead Disk o kern/158802 fs [amd] amd(8) ICMP storm and unkillable process. o kern/158711 fs [ffs] [panic] panic in ffs_blkfree and ffs_valloc o kern/158231 fs [nullfs] panic on unmounting nullfs mounted over ufs o f kern/157929 fs [nfs] NFS slow read o kern/157722 fs [geli] unable to newfs a geli encrypted partition o kern/157399 fs [zfs] trouble with: mdconfig force delete && zfs strip o kern/157179 fs [zfs] zfs/dbuf.c: panic: solaris assert: arc_buf_remov o kern/156933 fs [zfs] ZFS receive after read on readonly=on filesystem o kern/156797 fs [zfs] [panic] Double panic with FreeBSD 9-CURRENT and o kern/156781 fs [zfs] zfs is losing the snapshot directory, p kern/156545 fs [ufs] mv could break UFS on SMP systems o kern/156193 fs [ufs] [hang] UFS snapshot hangs && deadlocks processes o kern/156168 fs [nfs] [panic] Kernel panic under concurrent access ove o kern/156039 fs [nullfs] [unionfs] nullfs + unionfs do not compose, re o kern/155615 fs [zfs] zfs v28 broken on sparc64 -current o kern/155587 fs [zfs] [panic] kernel panic with zfs o kern/155411 fs [regression] [8.2-release] [tmpfs]: mount: tmpfs : No o kern/155199 fs [ext2fs] ext3fs mounted as ext2fs gives I/O errors o bin/155104 fs [zfs][patch] use /dev prefix by default when importing o kern/154930 fs [zfs] cannot delete/unlink file from full volume -> EN o kern/154828 fs [msdosfs] Unable to create directories on external USB o kern/154491 fs [smbfs] smb_co_lock: recursive lock for object 1 o kern/154447 fs [zfs] [panic] Occasional panics - solaris assert somew p kern/154228 fs [md] md getting stuck in wdrain state o kern/153996 fs [zfs] zfs root mount error while kernel is not located o kern/153847 fs [nfs] [panic] Kernel panic from incorrect m_free in nf o kern/153753 fs [zfs] ZFS v15 - grammatical error when attempting to u o kern/153716 fs [zfs] zpool scrub time remaining is incorrect o kern/153695 fs [patch] [zfs] Booting from zpool created on 4k-sector o kern/153680 fs [xfs] 8.1 failing to mount XFS partitions o kern/153520 fs [zfs] Boot from GPT ZFS root on HP BL460c G1 unstable o kern/153418 fs [zfs] [panic] Kernel Panic occurred writing to zfs vol o kern/153351 fs [zfs] locking directories/files in ZFS o bin/153258 fs [patch][zfs] creating ZVOLs requires `refreservation' s kern/153173 fs [zfs] booting from a gzip-compressed dataset doesn't w o kern/153126 fs [zfs] vdev failure, zpool=peegel type=vdev.too_small p kern/152488 fs [tmpfs] [patch] mtime of file updated when only inode o kern/152022 fs [nfs] nfs service hangs with linux client [regression] o kern/151942 fs [zfs] panic during ls(1) zfs snapshot directory o kern/151905 fs [zfs] page fault under load in /sbin/zfs o kern/151845 fs [smbfs] [patch] smbfs should be upgraded to support Un o bin/151713 fs [patch] Bug in growfs(8) with respect to 32-bit overfl o kern/151648 fs [zfs] disk wait bug o kern/151629 fs [fs] [patch] Skip empty directory entries during name o kern/151330 fs [zfs] will unshare all zfs filesystem after execute a o kern/151326 fs [nfs] nfs exports fail if netgroups contain duplicate o kern/151251 fs [ufs] Can not create files on filesystem with heavy us o kern/151226 fs [zfs] can't delete zfs snapshot o kern/151111 fs [zfs] vnodes leakage during zfs unmount o kern/150503 fs [zfs] ZFS disks are UNAVAIL and corrupted after reboot o kern/150501 fs [zfs] ZFS vdev failure vdev.bad_label on amd64 o kern/150390 fs [zfs] zfs deadlock when arcmsr reports drive faulted o kern/150336 fs [nfs] mountd/nfsd became confused; refused to reload n o kern/150207 fs zpool(1): zpool import -d /dev tries to open weird dev o kern/149208 fs mksnap_ffs(8) hang/deadlock o kern/149173 fs [patch] [zfs] make OpenSolaris installa o kern/149015 fs [zfs] [patch] misc fixes for ZFS code to build on Glib o kern/149014 fs [zfs] [patch] declarations in ZFS libraries/utilities o kern/149013 fs [zfs] [patch] make ZFS makefiles use the libraries fro o kern/148504 fs [zfs] ZFS' zpool does not allow replacing drives to be o kern/148490 fs [zfs]: zpool attach - resilver bidirectionally, and re o kern/148368 fs [zfs] ZFS hanging forever on 8.1-PRERELEASE o bin/148296 fs [zfs] [loader] [patch] Very slow probe in /usr/src/sys o kern/148204 fs [nfs] UDP NFS causes overload o kern/148138 fs [zfs] zfs raidz pool commands freeze o kern/147903 fs [zfs] [panic] Kernel panics on faulty zfs device o kern/147881 fs [zfs] [patch] ZFS "sharenfs" doesn't allow different " o kern/147790 fs [zfs] zfs set acl(mode|inherit) fails on existing zfs o kern/147560 fs [zfs] [boot] Booting 8.1-PRERELEASE raidz system take o kern/147420 fs [ufs] [panic] ufs_dirbad, nullfs, jail panic (corrupt o kern/146941 fs [zfs] [panic] Kernel Double Fault - Happens constantly o kern/146786 fs [zfs] zpool import hangs with checksum errors o kern/146708 fs [ufs] [panic] Kernel panic in softdep_disk_write_compl o kern/146528 fs [zfs] Severe memory leak in ZFS on i386 o kern/146502 fs [nfs] FreeBSD 8 NFS Client Connection to Server s kern/145712 fs [zfs] cannot offline two drives in a raidz2 configurat o kern/145411 fs [xfs] [panic] Kernel panics shortly after mounting an o bin/145309 fs bsdlabel: Editing disk label invalidates the whole dev o kern/145272 fs [zfs] [panic] Panic during boot when accessing zfs on o kern/145246 fs [ufs] dirhash in 7.3 gratuitously frees hashes when it o kern/145238 fs [zfs] [panic] kernel panic on zpool clear tank o kern/145229 fs [zfs] Vast differences in ZFS ARC behavior between 8.0 o kern/145189 fs [nfs] nfsd performs abysmally under load o kern/144929 fs [ufs] [lor] vfs_bio.c + ufs_dirhash.c p kern/144447 fs [zfs] sharenfs fsunshare() & fsshare_main() non functi o kern/144416 fs [panic] Kernel panic on online filesystem optimization s kern/144415 fs [zfs] [panic] kernel panics on boot after zfs crash o kern/144234 fs [zfs] Cannot boot machine with recent gptzfsboot code o kern/143825 fs [nfs] [panic] Kernel panic on NFS client o bin/143572 fs [zfs] zpool(1): [patch] The verbose output from iostat o kern/143212 fs [nfs] NFSv4 client strange work ... o kern/143184 fs [zfs] [lor] zfs/bufwait LOR o kern/142878 fs [zfs] [vfs] lock order reversal o kern/142597 fs [ext2fs] ext2fs does not work on filesystems with real o kern/142489 fs [zfs] [lor] allproc/zfs LOR o kern/142466 fs Update 7.2 -> 8.0 on Raid 1 ends with screwed raid [re o kern/142306 fs [zfs] [panic] ZFS drive (from OSX Leopard) causes two o kern/142068 fs [ufs] BSD labels are got deleted spontaneously o kern/141897 fs [msdosfs] [panic] Kernel panic. msdofs: file name leng o kern/141463 fs [nfs] [panic] Frequent kernel panics after upgrade fro o kern/141305 fs [zfs] FreeBSD ZFS+sendfile severe performance issues ( o kern/141091 fs [patch] [nullfs] fix panics with DIAGNOSTIC enabled o kern/141086 fs [nfs] [panic] panic("nfs: bioread, not dir") on FreeBS o kern/141010 fs [zfs] "zfs scrub" fails when backed by files in UFS2 o kern/140888 fs [zfs] boot fail from zfs root while the pool resilveri o kern/140661 fs [zfs] [patch] /boot/loader fails to work on a GPT/ZFS- o kern/140640 fs [zfs] snapshot crash o kern/140068 fs [smbfs] [patch] smbfs does not allow semicolon in file o kern/139725 fs [zfs] zdb(1) dumps core on i386 when examining zpool c o kern/139715 fs [zfs] vfs.numvnodes leak on busy zfs p bin/139651 fs [nfs] mount(8): read-only remount of NFS volume does n o kern/139597 fs [patch] [tmpfs] tmpfs initializes va_gen but doesn't u o kern/139564 fs [zfs] [panic] 8.0-RC1 - Fatal trap 12 at end of shutdo o kern/139407 fs [smbfs] [panic] smb mount causes system crash if remot o kern/138662 fs [panic] ffs_blkfree: freeing free block o kern/138421 fs [ufs] [patch] remove UFS label limitations o kern/138202 fs mount_msdosfs(1) see only 2Gb o kern/136968 fs [ufs] [lor] ufs/bufwait/ufs (open) o kern/136945 fs [ufs] [lor] filedesc structure/ufs (poll) o kern/136944 fs [ffs] [lor] bufwait/snaplk (fsync) o kern/136873 fs [ntfs] Missing directories/files on NTFS volume o kern/136865 fs [nfs] [patch] NFS exports atomic and on-the-fly atomic p kern/136470 fs [nfs] Cannot mount / in read-only, over NFS o kern/135546 fs [zfs] zfs.ko module doesn't ignore zpool.cache filenam o kern/135469 fs [ufs] [panic] kernel crash on md operation in ufs_dirb o kern/135050 fs [zfs] ZFS clears/hides disk errors on reboot o kern/134491 fs [zfs] Hot spares are rather cold... o kern/133676 fs [smbfs] [panic] umount -f'ing a vnode-based memory dis o kern/133174 fs [msdosfs] [patch] msdosfs must support multibyte inter o kern/132960 fs [ufs] [panic] panic:ffs_blkfree: freeing free frag o kern/132397 fs reboot causes filesystem corruption (failure to sync b o kern/132331 fs [ufs] [lor] LOR ufs and syncer o kern/132237 fs [msdosfs] msdosfs has problems to read MSDOS Floppy o kern/132145 fs [panic] File System Hard Crashes o kern/131441 fs [unionfs] [nullfs] unionfs and/or nullfs not combineab o kern/131360 fs [nfs] poor scaling behavior of the NFS server under lo o kern/131342 fs [nfs] mounting/unmounting of disks causes NFS to fail o bin/131341 fs makefs: error "Bad file descriptor" on the mount poin o kern/130920 fs [msdosfs] cp(1) takes 100% CPU time while copying file o kern/130210 fs [nullfs] Error by check nullfs f kern/130133 fs [panic] [zfs] 'kmem_map too small' caused by make clea o kern/129760 fs [nfs] after 'umount -f' of a stale NFS share FreeBSD l o kern/129488 fs [smbfs] Kernel "bug" when using smbfs in smbfs_smb.c: o kern/129231 fs [ufs] [patch] New UFS mount (norandom) option - mostly o kern/129152 fs [panic] non-userfriendly panic when trying to mount(8) o kern/127787 fs [lor] [ufs] Three LORs: vfslock/devfs/vfslock, ufs/vfs f kern/127375 fs [zfs] If vm.kmem_size_max>"1073741823" then write spee o bin/127270 fs fsck_msdosfs(8) may crash if BytesPerSec is zero o kern/127029 fs [panic] mount(8): trying to mount a write protected zi f kern/126703 fs [panic] [zfs] _mtx_lock_sleep: recursed on non-recursi o kern/126287 fs [ufs] [panic] Kernel panics while mounting an UFS file o kern/125895 fs [ffs] [panic] kernel: panic: ffs_blkfree: freeing free s kern/125738 fs [zfs] [request] SHA256 acceleration in ZFS o kern/123939 fs [msdosfs] corrupts new files f sparc/123566 fs [zfs] zpool import issue: EOVERFLOW o kern/122380 fs [ffs] ffs_valloc:dup alloc (Soekris 4801/7.0/USB Flash o bin/122172 fs [fs]: amd(8) automount daemon dies on 6.3-STABLE i386, o bin/121898 fs [nullfs] pwd(1)/getcwd(2) fails with Permission denied o bin/121366 fs [zfs] [patch] Automatic disk scrubbing from periodic(8 o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha o kern/120483 fs [ntfs] [patch] NTFS filesystem locking changes o kern/120482 fs [ntfs] [patch] Sync style changes between NetBSD and F f kern/120210 fs [zfs] [panic] reboot after panic: solaris assert: arc_ o kern/118912 fs [2tb] disk sizing/geometry problem with large array o kern/118713 fs [minidump] [patch] Display media size required for a k o bin/118249 fs [ufs] mv(1): moving a directory changes its mtime o kern/118126 fs [nfs] [patch] Poor NFS server write performance o kern/118107 fs [ntfs] [panic] Kernel panic when accessing a file at N o kern/117954 fs [ufs] dirhash on very large directories blocks the mac o bin/117315 fs [smbfs] mount_smbfs(8) and related options can't mount o kern/117314 fs [ntfs] Long-filename only NTFS fs'es cause kernel pani o kern/117158 fs [zfs] zpool scrub causes panic if geli vdevs detach on o bin/116980 fs [msdosfs] [patch] mount_msdosfs(8) resets some flags f o conf/116931 fs lack of fsck_cd9660 prevents mounting iso images with o kern/116583 fs [ffs] [hang] System freezes for short time when using o bin/115361 fs [zfs] mount(8) gets into a state where it won't set/un o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o kern/113852 fs [smbfs] smbfs does not properly implement DFS referral o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/111843 fs [msdosfs] Long Names of files are incorrectly created o kern/111782 fs [ufs] dump(8) fails horribly for large filesystems s bin/111146 fs [2tb] fsck(8) fails on 6T filesystem o kern/109024 fs [msdosfs] [iconv] mount_msdosfs: msdosfs_iconv: Operat o kern/109010 fs [msdosfs] can't mv directory within fat32 file system o bin/107829 fs [2TB] fdisk(8): invalid boundary checking in fdisk / w o kern/106107 fs [ufs] left-over fsck_snapshot after unfinished backgro o kern/104406 fs [ufs] Processes get stuck in "ufs" state under persist o kern/104133 fs [ext2fs] EXT2FS module corrupts EXT2/3 filesystems o kern/103035 fs [ntfs] Directories in NTFS mounted disc images appear o kern/101324 fs [smbfs] smbfs sometimes not case sensitive when it's s o kern/99290 fs [ntfs] mount_ntfs ignorant of cluster sizes s bin/97498 fs [request] newfs(8) has no option to clear the first 12 o kern/97377 fs [ntfs] [patch] syntax cleanup for ntfs_ihash.c o kern/95222 fs [cd9660] File sections on ISO9660 level 3 CDs ignored o kern/94849 fs [ufs] rename on UFS filesystem is not atomic o bin/94810 fs fsck(8) incorrectly reports 'file system marked clean' o kern/94769 fs [ufs] Multiple file deletions on multi-snapshotted fil o kern/94733 fs [smbfs] smbfs may cause double unlock o kern/93942 fs [vfs] [patch] panic: ufs_dirbad: bad dir (patch from D o kern/92272 fs [ffs] [hang] Filling a filesystem while creating a sna o kern/91134 fs [smbfs] [patch] Preserve access and modification time a kern/90815 fs [smbfs] [patch] SMBFS with character conversions somet o kern/88657 fs [smbfs] windows client hang when browsing a samba shar o kern/88555 fs [panic] ffs_blkfree: freeing free frag on AMD 64 o kern/88266 fs [smbfs] smbfs does not implement UIO_NOCOPY and sendfi o bin/87966 fs [patch] newfs(8): introduce -A flag for newfs to enabl o kern/87859 fs [smbfs] System reboot while umount smbfs. o kern/86587 fs [msdosfs] rm -r /PATH fails with lots of small files o bin/85494 fs fsck_ffs: unchecked use of cg_inosused macro etc. o kern/80088 fs [smbfs] Incorrect file time setting on NTFS mounted vi o bin/74779 fs Background-fsck checks one filesystem twice and omits o kern/73484 fs [ntfs] Kernel panic when doing `ls` from the client si o bin/73019 fs [ufs] fsck_ufs(8) cannot alloc 607016868 bytes for ino o kern/71774 fs [ntfs] NTFS cannot "see" files on a WinXP filesystem o bin/70600 fs fsck(8) throws files away when it can't grow lost+foun o kern/68978 fs [panic] [ufs] crashes with failing hard disk, loose po o kern/65920 fs [nwfs] Mounted Netware filesystem behaves strange o kern/65901 fs [smbfs] [patch] smbfs fails fsx write/truncate-down/tr o kern/61503 fs [smbfs] mount_smbfs does not work as non-root o kern/55617 fs [smbfs] Accessing an nsmb-mounted drive via a smb expo o kern/51685 fs [hang] Unbounded inode allocation causes kernel to loc o kern/51583 fs [nullfs] [patch] allow to work with devices and socket o kern/36566 fs [smbfs] System reboot with dead smb mount and umount o kern/33464 fs [ufs] soft update inconsistencies after system crash o bin/27687 fs fsck(8) wrapper is not properly passing options to fsc o kern/18874 fs [2TB] 32bit NFS servers export wrong negative values t 245 problems total. From owner-freebsd-fs@FreeBSD.ORG Mon Aug 22 12:30:12 2011 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1AD2E1065672 for ; Mon, 22 Aug 2011 12:30:12 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id EFD1F8FC1A for ; Mon, 22 Aug 2011 12:30:11 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p7MCUBHL076572 for ; Mon, 22 Aug 2011 12:30:11 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p7MCUBoN076569; Mon, 22 Aug 2011 12:30:11 GMT (envelope-from gnats) Date: Mon, 22 Aug 2011 12:30:11 GMT Message-Id: <201108221230.p7MCUBoN076569@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org From: John Baldwin Cc: Subject: Re: amd64/159930: kernel core X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: John Baldwin List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Aug 2011 12:30:12 -0000 The following reply was made to PR kern/159930; it has been noted by GNATS. From: John Baldwin To: freebsd-amd64@freebsd.org Cc: Wouter Snels , freebsd-gnats-submit@freebsd.org Subject: Re: amd64/159930: kernel core Date: Mon, 22 Aug 2011 08:27:34 -0400 On Friday, August 19, 2011 6:50:51 pm Wouter Snels wrote: > > >Number: 159930 > >Category: amd64 > >Synopsis: kernel core > >Confidential: no > >Severity: non-critical > >Priority: medium > >Responsible: freebsd-amd64 > >State: open > >Quarter: > >Keywords: > >Date-Required: > >Class: sw-bug > >Submitter-Id: current-users > >Arrival-Date: Fri Aug 19 23:00:25 UTC 2011 > >Closed-Date: > >Last-Modified: > >Originator: Wouter Snels > >Release: FreeBSD 8.2 > >Organization: > >Environment: > FreeBSD spark.ofloo.net 8.2-RELEASE-p2 FreeBSD 8.2-RELEASE-p2 #0: Wed Jul 13 15:20:57 CEST 2011 ofloo@spark.ofloo.net:/usr/obj/usr/src/sys/OFL amd64 > > >Description: > Fatal trap 12: page fault while in kernel mode > cpuid = 2; apic id = 02 > fault virtual address = 0x30 > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff805dd943 > stack pointer = 0x28:0xffffff8091e3d6c0 > frame pointer = 0x28:0xffffff8091e3d6f0 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 18 (softdepflush) > trap number = 12 > panic: page fault > cpuid = 2 > KDB: stack backtrace: > #0 0xffffffff8063300e at kdb_backtrace+0x5e > #1 0xffffffff80602627 at panic+0x187 > #2 0xffffffff808fbbe0 at trap_fatal+0x290 > #3 0xffffffff808fbfbf at trap_pfault+0x28f > #4 0xffffffff808fc49f at trap+0x3df > #5 0xffffffff808e4644 at calltrap+0x8 > #6 0xffffffff805f668a at priv_check_cred+0x3a > #7 0xffffffff8084ebd0 at chkdq+0x310 > #8 0xffffffff8082db5d at ffs_truncate+0xfed > #9 0xffffffff8084ac5c at ufs_inactive+0x21c > #10 0xffffffff8068a761 at vinactive+0x71 > #11 0xffffffff806904b8 at vputx+0x2d8 > #12 0xffffffff80836386 at handle_workitem_remove+0x206 > #13 0xffffffff8083675e at process_worklist_item+0x20e > #14 0xffffffff80838893 at softdep_process_worklist+0xe3 > #15 0xffffffff80839d3c at softdep_flush+0x17c > #16 0xffffffff805d9f28 at fork_exit+0x118 > #17 0xffffffff808e4b0e at fork_trampoline+0xe > Uptime: 2d4h7m56s > Cannot dump. Device not defined or unavailable. > Automatic reboot in 15 seconds - press a key on the console to abort > panic: bufwrite: buffer is not busy??? Hmm, the panic seems to be caused by a null ucred pointer passed to priv_check_cred() in chkdq(): if ((flags & FORCE) == 0 && priv_check_cred(cred, PRIV_VFS_EXCEEDQUOTA, 0)) do_check = 1; else do_check = 0; However, ffs_truncate() passes in NOCRED for its credential: if ((flags & IO_EXT) && extblocks > 0) { ... #ifdef QUOTA (void) chkdq(ip, -extblocks, NOCRED, 0); #endif A few other places call chkdq() with NOCRED (but not with the FORCE flag): ffs/ffs_inode.c:522: (void) chkdq(ip, -blocksreleased, NOCRED, 0); ffs/ffs_softdep.c:6201: (void) chkdq(ip, -datablocks, NOCRED, 0); ffs/ffs_softdep.c:6431: (void) chkdq(ip, -datablocks, NOCRED, 0); Hmm, all these calls should be passing in a negative value though, and reducing usage takes a shorter path at the start of chkdq() that always returns without ever getting to the call to priv_check_cred(). Similarly if the value (e.g. extblocks) was 0. This implies that extblocks was a negative value which seems very odd. Especially given the logic in ffs_truncate(): if ((flags & IO_EXT) && extblocks > 0) { ... if ((error = ffs_syncvnode(vp, MNT_WAIT)) != 0) return (error); #ifdef QUOTA (void) chkdq(ip, -extblocks, NOCRED, 0); #endif Nothing changes extblocks in between that check and the call to chkdq(). It would probably be best to get a crashdump if this is reproducible so we can investigate it further. -- John Baldwin From owner-freebsd-fs@FreeBSD.ORG Mon Aug 22 15:30:11 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 130FB106564A for ; Mon, 22 Aug 2011 15:30:11 +0000 (UTC) (envelope-from tdb@carrick.bishnet.net) Received: from carrick.bishnet.net (carrick.bishnet.net [IPv6:2a01:348:132:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id CD6BC8FC08 for ; Mon, 22 Aug 2011 15:30:10 +0000 (UTC) Received: from carrick-users.bishnet.net ([2a01:348:132:51::10]) by carrick.bishnet.net with esmtps (TLSv1:AES256-SHA:256) (Exim 4.76 (FreeBSD)) (envelope-from ) id 1QvWRt-000GCo-Of for freebsd-fs@freebsd.org; Mon, 22 Aug 2011 16:30:01 +0100 Received: (from tdb@localhost) by carrick-users.bishnet.net (8.14.4/8.14.4/Submit) id p7MFTxjA062258 for freebsd-fs@freebsd.org; Mon, 22 Aug 2011 16:29:59 +0100 (BST) (envelope-from tdb) Date: Mon, 22 Aug 2011 16:29:59 +0100 From: Tim Bishop To: freebsd-fs@freebsd.org Message-ID: <20110822152959.GC9389@carrick-users.bishnet.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-PGP-Key: 0x5AE7D984, http://www.bishnet.net/tim/tim-bishnet-net.asc X-PGP-Fingerprint: 1453 086E 9376 1A50 ECF6 AE05 7DCE D659 5AE7 D984 User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Can't boot ZFS: invalid zap_type X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Aug 2011 15:30:11 -0000 I'm having trouble booting from a ZFS pool. I get the following error at boot time: ZFS: invalid zap_type=134218628 I have my root fs on ZFS, and I'm using gptzfsboot. The machine is on RELENG_8. I'd been having some performance problems so I tried rolling back RELENG_8 to the start of the month. It hasn't booted since I did that. I can boot from a live cd and I can import and read the pool. I brought the OS back to the latest RELENG_8 too, but that didn't help. Any suggestions? As an aside, and this isn't meant to be too negative, I've been having a lot of issues with gptzfsboot. It's been fine for ages, but as soon as I've started changing disks or putting the disks in another machine, I've had lots of problems (mostly the "all block copies unavailable" error). It seems gptzfsboot does its job fine but is quite easy to break if things aren't exactly how it expects. Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984 From owner-freebsd-fs@FreeBSD.ORG Mon Aug 22 19:15:07 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BB119106566B; Mon, 22 Aug 2011 19:15:07 +0000 (UTC) (envelope-from delphij@gmail.com) Received: from mail-yi0-f54.google.com (mail-yi0-f54.google.com [209.85.218.54]) by mx1.freebsd.org (Postfix) with ESMTP id 0FABE8FC13; Mon, 22 Aug 2011 19:15:06 +0000 (UTC) Received: by yib19 with SMTP id 19so4314816yib.13 for ; Mon, 22 Aug 2011 12:15:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=1mbGOtEUNrj50n+NZ7vcQsNtaCi5qB7Mf74XupBdti0=; b=Ooq37H51YRGrGayLmvyIv56lZCiz3xabC1Vm5mLLaVmmVD/OELcWWqk3HtYt7F9Er6 YYzFsn/UOZY0TtTGkDw5lYjtGiFXjSpnmv20eSYGZUUMKm8sLkpH8pCSOiaJiBl0qGke OzmQIrYyGJGoY0uGzTocWF85nrXVy4N0zzftg= MIME-Version: 1.0 Received: by 10.151.157.11 with SMTP id j11mr2624198ybo.392.1314040506447; Mon, 22 Aug 2011 12:15:06 -0700 (PDT) Received: by 10.150.136.11 with HTTP; Mon, 22 Aug 2011 12:15:06 -0700 (PDT) In-Reply-To: <4E521CB0.3050806@FreeBSD.org> References: <4E521CB0.3050806@FreeBSD.org> Date: Mon, 22 Aug 2011 12:15:06 -0700 Message-ID: From: Xin LI To: Martin Matuska Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek , Dmitry Morozovsky Subject: Re: strange ZFS v28 states after disk upgrades/rebuilds X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Aug 2011 19:15:07 -0000 On Mon, Aug 22, 2011 at 2:09 AM, Martin Matuska wrote: > I suggest it may get MFCed soon, as the 1 week testing period is already > over. +1. > On 22. 8. 2011 11:03, Dmitry Morozovsky wrote: >> Dear Xin, >> >> On Sun, 21 Aug 2011, Xin LI wrote: >> >>> Pawel have recently committed a fix to ZFS which is likely to be >>> related to your problem. =C2=A0Could you please try commit =C2=A0224791= or -HEAD >>> newer than Aug 12 2011 07:04:16 2011 UTC? >> Yep, this at least seems to be related. >> >> However, now I'm in te middle of the last `zpool replace', and >> reproducing the deadlock is somewhat non-linear. >> >> Thanks! >> > > -- > Martin Matuska > FreeBSD committer > http://blog.vx.sk > > --=20 Xin LI https://www.delphij.net/ FreeBSD - The Power to Serve! Live free or die From owner-freebsd-fs@FreeBSD.ORG Tue Aug 23 10:11:05 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8F693106566C for ; Tue, 23 Aug 2011 10:11:05 +0000 (UTC) (envelope-from freebsd-fs@m.gmane.org) Received: from lo.gmane.org (lo.gmane.org [80.91.229.12]) by mx1.freebsd.org (Postfix) with ESMTP id 1F32E8FC13 for ; Tue, 23 Aug 2011 10:11:04 +0000 (UTC) Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1Qvnwm-0002cL-3L for freebsd-fs@freebsd.org; Tue, 23 Aug 2011 12:11:04 +0200 Received: from 208.88.188.90.adsl.tomsknet.ru ([90.188.88.208]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 23 Aug 2011 12:11:04 +0200 Received: from vadim_nuclight by 208.88.188.90.adsl.tomsknet.ru with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 23 Aug 2011 12:11:04 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-fs@freebsd.org From: Vadim Goncharov Date: Tue, 23 Aug 2011 10:10:50 +0000 (UTC) Organization: Nuclear Lightning @ Tomsk, TPU AVTF Hostel Lines: 76 Message-ID: References: <1303085986.99226.1313794735324.JavaMail.root@erie.cs.uoguelph.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: 208.88.188.90.adsl.tomsknet.ru X-Comment-To: Rick Macklem User-Agent: slrn/0.9.9p1 (FreeBSD) Subject: Re: touch(1) not working on directories in an msdosfs(5) envirement X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: vadim_nuclight@mail.ru List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Aug 2011 10:11:05 -0000 Hi Rick Macklem! On Fri, 19 Aug 2011 18:58:55 -0400 (EDT); Rick Macklem wrote about 'Re: touch(1) not working on directories in an msdosfs(5) envirement': >>> Yes, FAT file systems do not maintain a directory modify >>> time. (The original FAT12,16 structure didn't even have a >>> modify time for the root dir.) >> >>> Just like Windows. >> >>> This causes issues when a FAT fs is exported via NFS and >>> someone was going to experiment with an "in memory only" >>> modify time for dirs, to minimize caching issues, but I >>> haven't heard back from them lately. >> >>> Apparently Mac OS X chooses to update the modify time that >>> exists on FAT32 file systems, but that isn't Windows compatible. >> >> What? I've just now created a test directory and changed it's modify >> time >> in Far Manager on Windows 2000, in a FAT32 partition. In fact it >> allows to >> change all three directory times, creation and access, too. So, I >> conclude, >> the FAT supports it. >> > Well, FAT32 (not the root dir of FAT12 or FAT16) does have a modify > time stored on disk for the directory entry for a directory. > The case I was thinking of (because that was what affected NFS client > caching) was the case where an entry is added to a directory. I just > checked that and it does not change the directory's modify time when > an entry is added to a directory (at least for Windows7 personal...). > I'm not enough of a Windows guy to even know what "Far Manager" is, > but I'm not surprised that there is a tool that can change it. That's a plugin-enabled console file two-panel file manager, of which, um, the Midnight Commander is just a cheap unperful buggy clone :-) But even this does not preserve directory times when directory trees are copied, yes. It just has a separate dialog to modify all file's attributes and times. I had to write a small Delphi program several years ago to make copying directory times from tree to tree. > msdosfs_setattr() in sys/fs/msdosfs/msdosfs_vnops.c definitely only > does it for non-directories: > if (vp->v_type != VDIR) { > if ((pmp->pm_flags & MSDOSFSMNT_NOWIN95) == 0 && > vap->va_atime.tv_sec != VNOVAL) { > dep->de_flag &= ~DE_ACCESS; > timespec2fattime(&vap->va_atime, 0, > &dep->de_ADate, NULL, NULL); > } > if (vap->va_mtime.tv_sec != VNOVAL) { > dep->de_flag &= ~DE_UPDATE; > timespec2fattime(&vap->va_mtime, 0, > &dep->de_MDate, &dep->de_MTime, NULL); > } > dep->de_Attributes |= ATTR_ARCHIVE; > dep->de_flag |= DE_MODIFIED; > } > I'm not the author of the above, but I had assumed that it was > because Windows doesn't normally update it. Obviously, the above > code could easily be changed (although I haven't tested that), if > that is now considered correct behaviour. (It might have been > because the msdosfs is meant to work for all FAT variants.) I don't know about other variants, nowhere to check :-) And Windows doesn't, yes. Windows lacks many standard tools and actions which are supported by API, though. -- WBR, Vadim Goncharov. ICQ#166852181 mailto:vadim_nuclight@mail.ru [Anti-Greenpeace][Sober FreeBSD zealot][http://nuclight.livejournal.com] From owner-freebsd-fs@FreeBSD.ORG Tue Aug 23 10:27:30 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 38A2A106566B for ; Tue, 23 Aug 2011 10:27:30 +0000 (UTC) (envelope-from freebsd-fs@m.gmane.org) Received: from lo.gmane.org (lo.gmane.org [80.91.229.12]) by mx1.freebsd.org (Postfix) with ESMTP id BBB3D8FC08 for ; Tue, 23 Aug 2011 10:27:29 +0000 (UTC) Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1QvoCe-0008U6-Cw for freebsd-fs@freebsd.org; Tue, 23 Aug 2011 12:27:28 +0200 Received: from 208.88.188.90.adsl.tomsknet.ru ([90.188.88.208]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 23 Aug 2011 12:27:28 +0200 Received: from vadim_nuclight by 208.88.188.90.adsl.tomsknet.ru with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 23 Aug 2011 12:27:28 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-fs@freebsd.org From: Vadim Goncharov Date: Tue, 23 Aug 2011 10:27:14 +0000 (UTC) Organization: Nuclear Lightning @ Tomsk, TPU AVTF Hostel Lines: 53 Message-ID: References: <1092971110.92110.1313782831745.JavaMail.root@erie.cs.uoguelph.ca> <20110820145112.Y872@besplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: 208.88.188.90.adsl.tomsknet.ru X-Comment-To: Bruce Evans User-Agent: slrn/0.9.9p1 (FreeBSD) Subject: Re: touch(1) not working on directories in an msdosfs(5) envirement X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: vadim_nuclight@mail.ru List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Aug 2011 10:27:30 -0000 Hi Bruce Evans! On Sat, 20 Aug 2011 16:44:59 +1000 (EST); Bruce Evans wrote about 'Re: touch(1) not working on directories in an msdosfs(5) envirement': > The above is only the least serious of the bugs in msdosfs_setattr() :-(. > With the above fix, plain touch works as well as possible -- it cannot > work perfectly since setting of atimes is not always supported. But > touch -r and more importantly, cp -p only work as well as possible for > root, since they use utimes() without the null timeptr arg that allows > plain touch to work. A non-null timeptr arg ends up normally requiring > root permissions for msdosfs where it normally doesn't require extra > permissions for ffs, because ownership requirements for the non-null case > cannot be satisfied by file systems that don't really support ownerships. > We fudge the ownerships and use weak checks on them in most places, but > for utimes() we use strict checks that almost always fail: from my old > version: So, now the usual case of not touching directory times on change is preserved, but cp -r et al. sets times as expected? Sounds good, could it be committed please? > % file=z > % ... > % atime=Sat Aug 20 00:00:00 2011 (1313762400.0) > % ctime=Sat Aug 20 16:14:29 2011 (1313820869.740000000) > % mtime=Sat Aug 20 16:14:28 2011 (1313820868.0) > This has the expected 2-second granularity for the mtime, but the other > times are strange: > - the atime is far in the past, and according to other tests has a > granularity of at least 200 seconds > - the ctime has a granularity of 100 msec. This differs significantly > from the mtime's granularity, so the ctime is up to 1.99 seconds in > advance of the mtime. This is probably a local bug -- I probably > don't have the fix for confusion between the ctime and the creation > time (birthtime). msdosfs only has a creation time so the ctime must > be faked and should usually be the same as the mtime. But how does > the creation time have more precision? > In other tests, creat() of a file sets the mtime and ctime reasonably, > but the atime remains with a fixed value far in the past. touch > advances the mtime correctly, but doesn't update the ctime. This is > consistent with displayed ctime actually being the creation time. That's a brainfart problem of FAT designers. There were not enough bytes in directory entry, so for atime there is just 2 bytes - only date is stored, no time at all. And for ctime there was added more granularity while it is needed - one usually needs more granularity for atime than for fixed ctime. BTW, that's already non-their problem, but ours: that's actually _btime_, not ctime (Unices just had no btime ages ago). -- WBR, Vadim Goncharov. ICQ#166852181 mailto:vadim_nuclight@mail.ru [Anti-Greenpeace][Sober FreeBSD zealot][http://nuclight.livejournal.com] From owner-freebsd-fs@FreeBSD.ORG Tue Aug 23 12:30:54 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 07E85106564A for ; Tue, 23 Aug 2011 12:30:54 +0000 (UTC) (envelope-from wzrd@rambler.ru) Received: from mx-out-wr-1.rambler.ru (mx-out-wr-1.rambler.ru [81.19.92.40]) by mx1.freebsd.org (Postfix) with ESMTP id B4BDF8FC12 for ; Tue, 23 Aug 2011 12:30:53 +0000 (UTC) Received: from mcgi-wr-11.rambler.ru (mcgi-wr-11.rambler.ru [10.32.5.11]) by mx-out-wr-1.rambler.ru (Postfix) with ESMTP id 4D6F113FD852 for ; Tue, 23 Aug 2011 16:17:00 +0400 (MSD) Received: from mcgi-wr-11.rambler.ru (localhost [127.0.0.1]) by mcgi-wr-11.rambler.ru (Postfix) with ESMTP id 31F3F130082A for ; Tue, 23 Aug 2011 16:17:00 +0400 (MSD) Received: from [62.213.45.64] by mcgi-wr-11.rambler.ru with HTTP (mailimap); Tue, 23 Aug 2011 16:16:59 +0400 From: =?windows-1251?B?xeLj5e3o6Q==?= To: Date: Tue, 23 Aug 2011 16:16:59 +0400 MIME-Version: 1.0 Content-Disposition: inline Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; format="flowed" Message-Id: <17204483.1314101819.140163768.48609@mcgi-wr-11.rambler.ru> X-Mailer: Ramail 3u, (chameleon), http://mail.rambler.ru Subject: Upgrade ZFS v15 to ZFS v28 have problem X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Aug 2011 12:30:54 -0000 Hello after the transition from 8.2 release to # uname-a FreeBSD ftp.local 8.2-STABLE FreeBSD 8.2-STABLE # 0: Mon Aug 22 12:23:13 UTC 2011 and updating ZFSv15 to ZFSv28 (# CFLAGS + =- DDEBUG = 1 and # DEBUG_FLAGS =- r is off) was reduced productivity by 2-3 times. With what may be the reason? to update when copying speed was 700-800MB after update was 200-250MB :( -- Eugeny. From owner-freebsd-fs@FreeBSD.ORG Tue Aug 23 13:31:31 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 122E21065676 for ; Tue, 23 Aug 2011 13:31:31 +0000 (UTC) (envelope-from pawel@dawidek.net) Received: from mail.dawidek.net (60.wheelsystems.com [83.12.187.60]) by mx1.freebsd.org (Postfix) with ESMTP id B9D7E8FC28 for ; Tue, 23 Aug 2011 13:31:30 +0000 (UTC) Received: from localhost (58.wheelsystems.com [83.12.187.58]) by mail.dawidek.net (Postfix) with ESMTPSA id C5AD3A08; Tue, 23 Aug 2011 15:12:06 +0200 (CEST) Date: Tue, 23 Aug 2011 15:11:48 +0200 From: Pawel Jakub Dawidek To: Xin LI Message-ID: <20110823131148.GD1662@garage.freebsd.pl> References: <4E521CB0.3050806@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="HG+GLK89HZ1zG0kk" Content-Disposition: inline In-Reply-To: X-OS: FreeBSD 9.0-CURRENT amd64 User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org, Dmitry Morozovsky Subject: Re: strange ZFS v28 states after disk upgrades/rebuilds X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Aug 2011 13:31:31 -0000 --HG+GLK89HZ1zG0kk Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Aug 22, 2011 at 12:15:06PM -0700, Xin LI wrote: > On Mon, Aug 22, 2011 at 2:09 AM, Martin Matuska wrote: > > I suggest it may get MFCed soon, as the 1 week testing period is already > > over. >=20 > +1. Done. --=20 Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com --HG+GLK89HZ1zG0kk Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iEYEARECAAYFAk5TpxQACgkQForvXbEpPzTaUwCgldWxEbhv79olstkLbfQ5Ni2g qh0AnRXZQeCvHJyGX84c3rJzqvMdEf1P =4HuH -----END PGP SIGNATURE----- --HG+GLK89HZ1zG0kk-- From owner-freebsd-fs@FreeBSD.ORG Tue Aug 23 15:03:02 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C591E1065672 for ; Tue, 23 Aug 2011 15:03:02 +0000 (UTC) (envelope-from gerrit@pmp.uni-hannover.de) Received: from mrelay1.uni-hannover.de (mrelay1.uni-hannover.de [130.75.2.106]) by mx1.freebsd.org (Postfix) with ESMTP id 5B0F98FC0A for ; Tue, 23 Aug 2011 15:03:01 +0000 (UTC) Received: from www.pmp.uni-hannover.de (www.pmp.uni-hannover.de [130.75.117.2]) by mrelay1.uni-hannover.de (8.14.4/8.14.4) with ESMTP id p7NF2tJr017185 for ; Tue, 23 Aug 2011 17:02:57 +0200 Received: from pmp.uni-hannover.de (unknown [130.75.117.3]) by www.pmp.uni-hannover.de (Postfix) with SMTP id AF31910A for ; Tue, 23 Aug 2011 17:02:55 +0200 (CEST) Date: Tue, 23 Aug 2011 17:02:55 +0200 From: Gerrit =?ISO-8859-1?Q?K=FChn?= To: freebsd-fs@freebsd.org Message-Id: <20110823170255.9d36b2cd.gerrit@pmp.uni-hannover.de> Organization: Albert-Einstein-Institut (MPI =?ISO-8859-1?Q?f=FCr?= Gravitationsphysik & IGP =?ISO-8859-1?Q?Universit=E4t?= Hannover) X-Mailer: Sylpheed 3.0.3 (GTK+ 2.22.1; amd64-portbld-freebsd8.1) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-PMX-Version: 5.5.9.395186, Antispam-Engine: 2.7.2.376379, Antispam-Data: 2011.8.23.145415 Subject: zfs snapshot: Bad file descriptor X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: gerrit.kuehn@aei.mpg.de List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Aug 2011 15:03:02 -0000 Hi all, since upgrading some of my storage machines to recent 8.2-stable and zfs-v28 I see the following on some filesystems after some time of operation: --- mclane# ll /tank/home/pt/.zfs ls: snapshot: Bad file descriptor total 0 --- I make quite heavy use of snapshots on all my machines and use rsync to backup snapshots to other machines. Googleing around I found several people reporting similar problems, but no real solution (apart from rebooting, which is not really a thing you want to do every time you run into this). Is there any knowledge/ideas available over the list here how to improve this situation? Am I just one of the few unlucky people who see this, or is there an actual reason for this happening that could be fixed or circumvented? cu Gerrit From owner-freebsd-fs@FreeBSD.ORG Tue Aug 23 16:19:59 2011 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 50EFD1065672 for ; Tue, 23 Aug 2011 16:19:59 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail01.syd.optusnet.com.au (mail01.syd.optusnet.com.au [211.29.132.182]) by mx1.freebsd.org (Postfix) with ESMTP id DDF568FC0A for ; Tue, 23 Aug 2011 16:19:58 +0000 (UTC) Received: from c122-106-165-191.carlnfd1.nsw.optusnet.com.au (c122-106-165-191.carlnfd1.nsw.optusnet.com.au [122.106.165.191]) by mail01.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p7NGJubJ011409 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 24 Aug 2011 02:19:57 +1000 Date: Wed, 24 Aug 2011 02:19:56 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Vadim Goncharov In-Reply-To: Message-ID: <20110824015751.I2167@besplex.bde.org> References: <1092971110.92110.1313782831745.JavaMail.root@erie.cs.uoguelph.ca> <20110820145112.Y872@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@FreeBSD.org Subject: Re: touch(1) not working on directories in an msdosfs(5) envirement X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Aug 2011 16:19:59 -0000 On Tue, 23 Aug 2011, Vadim Goncharov wrote: > Hi Bruce Evans! > > On Sat, 20 Aug 2011 16:44:59 +1000 (EST); Bruce Evans wrote about 'Re: touch(1) not working on directories in an msdosfs(5) envirement': > >> The above is only the least serious of the bugs in msdosfs_setattr() :-(. >> With the above fix, plain touch works as well as possible -- it cannot >> work perfectly since setting of atimes is not always supported. But >> touch -r and more importantly, cp -p only work as well as possible for >> root, since they use utimes() without the null timeptr arg that allows >> plain touch to work. A non-null timeptr arg ends up normally requiring >> root permissions for msdosfs where it normally doesn't require extra >> permissions for ffs, because ownership requirements for the non-null case >> cannot be satisfied by file systems that don't really support ownerships. >> We fudge the ownerships and use weak checks on them in most places, but >> for utimes() we use strict checks that almost always fail: from my old >> version: > > So, now the usual case of not touching directory times on change is preserved, > but cp -r et al. sets times as expected? Sounds good, could it be committed > please? Yes, cp -p works but only if the user is root or the owner of the file. Someone else will have to commit it. >> % file=z >> % ... >> % atime=Sat Aug 20 00:00:00 2011 (1313762400.0) >> % ctime=Sat Aug 20 16:14:29 2011 (1313820869.740000000) >> % mtime=Sat Aug 20 16:14:28 2011 (1313820868.0) > >> This has the expected 2-second granularity for the mtime, but the other >> times are strange: >> - the atime is far in the past, and according to other tests has a >> granularity of at least 200 seconds >> - the ctime has a granularity of 100 msec. This differs significantly >> from the mtime's granularity, so the ctime is up to 1.99 seconds in >> advance of the mtime. This is probably a local bug -- I probably >> don't have the fix for confusion between the ctime and the creation >> time (birthtime). msdosfs only has a creation time so the ctime must >> be faked and should usually be the same as the mtime. But how does >> the creation time have more precision? >> In other tests, creat() of a file sets the mtime and ctime reasonably, >> but the atime remains with a fixed value far in the past. touch >> advances the mtime correctly, but doesn't update the ctime. This is >> consistent with displayed ctime actually being the creation time. > > That's a brainfart problem of FAT designers. There were not enough bytes > in directory entry, so for atime there is just 2 bytes - only date is stored, > no time at all. Ah, the missing seconds part of time is now clear in the FreeBSD sources too -- a null pointer is used. Now I wonder if this field is worth supporting at all and whether the current support of it is the best use of it (I guess there is nothing better than just writing the date and compatibility requires this). I normally mount with -noatime but am not so careful about this for msdosfs. FreeBSD could default to -noatime for msodsfs or any file system where atimes are even less useful than usual. OTOH, since the atimes only change every day, the -noatime option as a means to prevent excessive writes is not very useful. msdosfs already has the optimization of not writing to disk for null changes to atimes. I wondered why it didn't do this optimization for all null changes (my version does). Now I know. > And for ctime there was added more granularity while it is > needed - one usually needs more granularity for atime than for fixed ctime. > BTW, that's already non-their problem, but ours: that's actually _btime_, > not ctime (Unices just had no btime ages ago). Bruce From owner-freebsd-fs@FreeBSD.ORG Tue Aug 23 20:21:26 2011 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 272481065670; Tue, 23 Aug 2011 20:21:26 +0000 (UTC) (envelope-from mm@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id F3C3F8FC14; Tue, 23 Aug 2011 20:21:25 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p7NKLPpt026211; Tue, 23 Aug 2011 20:21:25 GMT (envelope-from mm@freefall.freebsd.org) Received: (from mm@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p7NKLPa3026207; Tue, 23 Aug 2011 20:21:25 GMT (envelope-from mm) Date: Tue, 23 Aug 2011 20:21:25 GMT Message-Id: <201108232021.p7NKLPa3026207@freefall.freebsd.org> To: mm@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-fs@FreeBSD.org From: mm@FreeBSD.org Cc: Subject: Re: kern/160035: [zfs] zfs rollback does not invalidate mmapped cache X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Aug 2011 20:21:26 -0000 Synopsis: [zfs] zfs rollback does not invalidate mmapped cache Responsible-Changed-From-To: freebsd-bugs->freebsd-fs Responsible-Changed-By: mm Responsible-Changed-When: Tue Aug 23 20:20:45 UTC 2011 Responsible-Changed-Why: Assign to freebsd-fs@FreeBSD.org http://www.freebsd.org/cgi/query-pr.cgi?pr=160035 From owner-freebsd-fs@FreeBSD.ORG Tue Aug 23 20:23:17 2011 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5063A106564A for ; Tue, 23 Aug 2011 20:23:17 +0000 (UTC) (envelope-from lev@FreeBSD.org) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [IPv6:2a01:4f8:131:60a2::2]) by mx1.freebsd.org (Postfix) with ESMTP id E6BD38FC15 for ; Tue, 23 Aug 2011 20:23:16 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:b1b7:d4b2:b3b3:a68b]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 97FEB4AC1C for ; Wed, 24 Aug 2011 00:23:15 +0400 (MSD) Date: Wed, 24 Aug 2011 00:23:12 +0400 From: Lev Serebryakov Organization: FreeBSD X-Priority: 3 (Normal) Message-ID: <12360197.20110824002312@serebryakov.spb.ru> To: freebsd-fs@FreeBSD.org MIME-Version: 1.0 Content-Type: text/plain; charset=windows-1251 Content-Transfer-Encoding: quoted-printable Cc: Subject: Is growfs(8) safe? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: lev@FreeBSD.org List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Aug 2011 20:23:17 -0000 Hello, Freebsd-fs. Is growfs(8) safe now? I remember, there were some problems with non-zeroed data, but it seems to be fixed. But it HAS one bug for sure: "int p_size," which overflows on big FSes. It is in sectors, and 32 bit signed, so max value is only slightly less than 1Tb (!) I've fixed it (PR will be a bit later), but I'm not sure, is it only bug in such important utility? Does somebody use it? Or everybody migrate to ZFS already? --=20 // Black Lion AKA Lev Serebryakov From owner-freebsd-fs@FreeBSD.ORG Tue Aug 23 20:50:58 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 69A96106566C for ; Tue, 23 Aug 2011 20:50:58 +0000 (UTC) (envelope-from dpd@bitgravity.com) Received: from mail-iy0-f172.google.com (mail-iy0-f172.google.com [209.85.210.172]) by mx1.freebsd.org (Postfix) with ESMTP id 3BB838FC0C for ; Tue, 23 Aug 2011 20:50:57 +0000 (UTC) Received: by iye7 with SMTP id 7so1516545iye.17 for ; Tue, 23 Aug 2011 13:50:57 -0700 (PDT) Received: by 10.231.6.159 with SMTP id 31mr8464525ibz.17.1314132657489; Tue, 23 Aug 2011 13:50:57 -0700 (PDT) Received: from netops-195.sfo1.bitgravity.com (netops-195.sfo1.bitgravity.com [209.131.110.195]) by mx.google.com with ESMTPS id g21sm112928ibl.41.2011.08.23.13.50.56 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 23 Aug 2011 13:50:57 -0700 (PDT) From: David P Discher Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Date: Tue, 23 Aug 2011 13:50:56 -0700 Message-Id: To: freebsd-fs@freebsd.org Mime-Version: 1.0 (Apple Message framework v1084) X-Mailer: Apple Mail (2.1084) Subject: _sx_xlock_hard panic - is this zfs_zget panic deadlock ? (8.1-RELEASE) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Aug 2011 20:50:58 -0000 Hey FreeBSD FS - I got a new one, well, new for us, possibly already = fixed. I got the following panic -=20 Fatal trap 9: general protection fault while in kernel mode cpuid =3D 13; apic id =3D 21 instruction pointer =3D 0x20:0xffffffff80514848 stack pointer =3D 0x28:0xffffff9be057f230 frame pointer =3D 0x28:0xffffff9be057f2b0 code segment =3D base 0x0, limit 0xfffff, type 0x1b =3D DPL 0, pres = 1, long 1, def32 0, gran 1 processor eflags =3D interrupt enabled, resume, IOPL =3D = 0 current process =3D 94019 (rsync) [thread pid 94019 tid 102556 ] Stopped at _sx_xlock_hard+0xd8: movl 0x290(%r12),%r8d More kgdb output below. It looks like the operation in kern_sx panic'ed = : #9 0xffffffff80514848 in _sx_xlock_hard (sx=3D0xffffff008040e0d8, = tid=3D18446742975725269984, opts=3DVariable "opts" is not available. ) at /usr/src/sys/kern/kern_sx.c:513 513 x =3D SX_OWNER(x); The calling chain looks like=20 zfs_zget -> dmu_bonus_hold -> dnode_hold_impl -> _sx_xlock -> = _sx_xlock_hard The best I can tell, is this looks like a race or deadlock issue, leaked = lock, etc, something along those lines. My searching has under covered a patch in = -head/-stable that may address this particular issue: = ------------------------------------------------------------------------ r209097 | mm | 2010-06-12 04:22:45 -0700 (Sat, 12 Jun 2010) | 8 = lines Changed paths: M = /head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c =09 Fix ZFS panic deadlock: cycle in blocking chain via zfs_zget =09 OpenSolaris onnv-revision: 9774:0bb234ab2287 =09 Obtained from: OpenSolaris (Bug ID 6788152) Approved by: pjd, delphij (mentor) MFC after: 3 days =09 = ------------------------------------------------------------------------ I'd like a little bit of confirmation or validation or otherwise if this = panic I'm looking at could in fact be fixed by r209097, or is this something = entirely different ? Thanks ! --- David P. Discher dpd@bitgravity.com * AIM: bgDavidDPD BITGRAVITY * http://www.bitgravity.com (kgdb) bt #0 doadump () at pcpu.h:223 #1 0xffffffff801f0e5c in db_fncall (dummy1=3DVariable "dummy1" is not = available. ) at /usr/src/sys/ddb/db_command.c:548 #2 0xffffffff801f1191 in db_command (last_cmdp=3D0xffffffff80b105e0, = cmd_table=3DVariable "cmd_table" is not available. ) at /usr/src/sys/ddb/db_command.c:445 #3 0xffffffff801f13e0 in db_command_loop () at = /usr/src/sys/ddb/db_command.c:498 #4 0xffffffff801f3339 in db_trap (type=3DVariable "type" is not = available. ) at /usr/src/sys/ddb/db_main.c:229 #5 0xffffffff8053aff5 in kdb_trap (type=3D9, code=3D0, = tf=3D0xffffff9be057f180) at /usr/src/sys/kern/subr_kdb.c:535 #6 0xffffffff8079519d in trap_fatal (frame=3D0xffffff9be057f180, = eva=3DVariable "eva" is not available. ) at /usr/src/sys/amd64/amd64/trap.c:772 #7 0xffffffff80795a9a in trap (frame=3D0xffffff9be057f180) at = /usr/src/sys/amd64/amd64/trap.c:588 #8 0xffffffff8077c827 in calltrap () at = /usr/src/sys/amd64/amd64/exception.S:223 #9 0xffffffff80514848 in _sx_xlock_hard (sx=3D0xffffff008040e0d8, = tid=3D18446742975725269984, opts=3DVariable "opts" is not available. ) at /usr/src/sys/kern/kern_sx.c:513 #10 0xffffffff80514c89 in _sx_xlock (sx=3D0xffffff008040e0d8, = opts=3DVariable "opts" is not available. ) at sx.h:148 #11 0xffffffff8104c82d in dnode_hold_impl (os=3D0xffffff002964f400, = object=3DVariable "object" is not available. ) at = /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/= dnode.c:607 #12 0xffffffff81042c7a in dmu_bonus_hold (os=3DVariable "os" is not = available. ) at = /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/= dmu.c:147 #13 0xffffffff81081e77 in zfs_zget (zfsvfs=3D0xffffff00295de000, = obj_num=3D10078547, zpp=3D0xffffff9be057f518) at = /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/= zfs_znode.c:869 #14 0xffffffff81092fb3 in zfs_dirent_lock (dlpp=3D0xffffff9be057f520, = dzp=3D0xffffff13942938d0, name=3D0xffffff9be057f5f0 "n", = zpp=3D0xffffff9be057f518,=20 flag=3DVariable "flag" is not available. ) at = /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/= zfs_dir.c:321 #15 0xffffffff81093209 in zfs_dirlook (dzp=3D0xffffff13942938d0, = name=3D0xffffff9be057f5f0 "n", vpp=3D0xffffff9be057f970, flags=3DVariable = "flags" is not available. ) at = /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/= zfs_dir.c:413 #16 0xffffffff810a2850 in zfs_lookup (dvp=3D0xffffff08daee6760, = nm=3D0xffffff9be057f5f0 "n", vpp=3D0xffffff9be057f970, = cnp=3D0xffffff9be057f998, nameiop=3D0,=20 cr=3D0xffffff028c6a5200, td=3D0xffffff005b0973e0, flags=3DVariable = "flags" is not available. ) at = /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/= zfs_vnops.c:1171 #17 0xffffffff810a3791 in zfs_freebsd_lookup (ap=3D0xffffff9be057f750) at = /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/= zfs_vnops.c:4059 #18 0xffffffff807e08f4 in VOP_CACHEDLOOKUP_APV (vop=3D0xffffffff81107b20, = a=3D0xffffff9be057f750) at vnode_if.c:187 #19 0xffffffff80580190 in vfs_cache_lookup (ap=3DVariable "ap" is not = available. ) at vnode_if.h:80 #20 0xffffffff807e33ec in VOP_LOOKUP_APV (vop=3D0xffffffff81107b20, = a=3D0xffffff9be057f810) at vnode_if.c:123 #21 0xffffffff80586864 in lookup (ndp=3D0xffffff9be057f940) at = vnode_if.h:54 #22 0xffffffff80587797 in namei (ndp=3D0xffffff9be057f940) at = /usr/src/sys/kern/vfs_lookup.c:269 #23 0xffffffff80594e02 in kern_statat_vnhook (td=3D0xffffff005b0973e0, = flag=3DVariable "flag" is not available. ) at /usr/src/sys/kern/vfs_syscalls.c:2346 #24 0xffffffff80595025 in kern_statat (td=3DVariable "td" is not = available. ) at /usr/src/sys/kern/vfs_syscalls.c:2327 #25 0xffffffff805950ea in lstat (td=3DVariable "td" is not available. ) at /usr/src/sys/kern/vfs_syscalls.c:2390 #26 0xffffffff8079564b in syscall (frame=3D0xffffff9be057fc80) at = /usr/src/sys/amd64/amd64/trap.c:945 #27 0xffffffff8077cb01 in Xfast_syscall () at = /usr/src/sys/amd64/amd64/exception.S:374 #28 0x00000008007810fc in ?? () Previous frame inner to this frame (corrupt stack?) (kgdb) up #1 0xffffffff801f0e5c in db_fncall (dummy1=3DVariable "dummy1" is not = available. ) at /usr/src/sys/ddb/db_command.c:548 548 *rv =3D (*f)(args[0], args[1], args[2], args[3], = args[4], args[5], (kgdb) up #2 0xffffffff801f1191 in db_command (last_cmdp=3D0xffffffff80b105e0, = cmd_table=3DVariable "cmd_table" is not available. ) at /usr/src/sys/ddb/db_command.c:445 445 (*cmd->fcn)(addr, have_addr, count, modif); (kgdb) up #3 0xffffffff801f13e0 in db_command_loop () at = /usr/src/sys/ddb/db_command.c:498 498 db_command(&db_last_command, &db_cmd_table, /* = dopager */ 1); (kgdb) up #4 0xffffffff801f3339 in db_trap (type=3DVariable "type" is not = available. ) at /usr/src/sys/ddb/db_main.c:229 229 db_command_loop(); (kgdb) up #5 0xffffffff8053aff5 in kdb_trap (type=3D9, code=3D0, = tf=3D0xffffff9be057f180) at /usr/src/sys/kern/subr_kdb.c:535 535 handled =3D kdb_dbbe->dbbe_trap(type, code); (kgdb) up #6 0xffffffff8079519d in trap_fatal (frame=3D0xffffff9be057f180, = eva=3DVariable "eva" is not available. ) at /usr/src/sys/amd64/amd64/trap.c:772 772 if (kdb_trap(type, 0, frame)) (kgdb) up #7 0xffffffff80795a9a in trap (frame=3D0xffffff9be057f180) at = /usr/src/sys/amd64/amd64/trap.c:588 588 trap_fatal(frame, 0); (kgdb) up #8 0xffffffff8077c827 in calltrap () at = /usr/src/sys/amd64/amd64/exception.S:223 223 call trap Current language: auto; currently asm (kgdb) up #9 0xffffffff80514848 in _sx_xlock_hard (sx=3D0xffffff008040e0d8, = tid=3D18446742975725269984, opts=3DVariable "opts" is not available. ) at /usr/src/sys/kern/kern_sx.c:513 513 x =3D SX_OWNER(x); Current language: auto; currently c (kgdb) p x $1 =3D 281466386776064 (kgdb) p sx $2 =3D (struct sx *) 0xffffff008040e0d8 (kgdb) p *sx $3 =3D {lock_object =3D {lo_name =3D 0x2901fc0044005e
, lo_flags =3D 2104063083, lo_data =3D = 1953056627,=20 lo_witness =3D 0x154407f4763bc41}, sx_lock =3D 281466386776064} (kgdb) list 508 * running or the state of the lock changes. 509 */ 510 x =3D sx->sx_lock; 511 if ((sx->lock_object.lo_flags & SX_NOADAPTIVE) = =3D=3D 0) { 512 if ((x & SX_LOCK_SHARED) =3D=3D 0) { 513 x =3D SX_OWNER(x); 514 owner =3D (struct thread *)x; 515 if (TD_IS_RUNNING(owner)) { 516 if = (LOCK_LOG_TEST(&sx->lock_object, 0)) 517 CTR3(KTR_LOCK, (kgdb) up=20 #10 0xffffffff80514c89 in _sx_xlock (sx=3D0xffffff008040e0d8, = opts=3DVariable "opts" is not available. ) at sx.h:148 148 error =3D _sx_xlock_hard(sx, tid, opts, file, = line); (kgdb) up #11 0xffffffff8104c82d in dnode_hold_impl (os=3D0xffffff002964f400, = object=3DVariable "object" is not available. ) at = /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/= dnode.c:607 607 mutex_enter(&dn->dn_mtx); (kgdb) list 602 dnode_destroy(dn); 603 dn =3D winner; 604 } 605 } 606 607 mutex_enter(&dn->dn_mtx); 608 type =3D dn->dn_type; 609 if (dn->dn_free_txg || 610 ((flag & DNODE_MUST_BE_ALLOCATED) && type =3D=3D = DMU_OT_NONE) || 611 ((flag & DNODE_MUST_BE_FREE) && type !=3D = DMU_OT_NONE)) { (kgdb) print dn $4 =3D (dnode_t *) 0xffffff008040e000 (kgdb) print *dn $5 =3D {dn_struct_rwlock =3D {lock_object =3D {lo_name =3D = 0xffffffff81100b7c "dn->dn_struct_rwlock", lo_flags =3D 40960000, = lo_data =3D 0, lo_witness =3D 0x0},=20 sx_lock =3D 1}, dn_link =3D {list_next =3D 0xffffff14dcffd630, = list_prev =3D 0xffffff14e5f64c40}, dn_objset =3D 0xffffff002964f400,=20 dn_object =3D 4611409328054323539, dn_dbuf =3D 0x10c837, dn_phys =3D = 0x3f, dn_type =3D 538976288, dn_bonuslen =3D 8224, dn_bonustype =3D 75 = 'K',=20 dn_nblkptr =3D 77 'M', dn_checksum =3D 51 '3', dn_compress =3D 48 '0', = dn_nlevels =3D 49 '1', dn_indblkshift =3D 51 '3', dn_datablkshift =3D 72 = 'H',=20 dn_datablkszsec =3D 18247, dn_datablksz =3D 1145130323, dn_maxblkid =3D = 4698733595085963320, dn_next_nblkptr =3D "08iH", dn_next_nlevels =3D = "athc",=20 dn_next_indblkshift =3D " iUH", dn_next_bonuslen =3D {16695, 12851, = 12339, 12353}, dn_next_blksz =3D {909397057, 538980384, 538976288, = 538976288},=20 dn_dirty_link =3D {{list_next =3D 0x2f00400080102020, list_prev =3D = 0x7020002004000}, {list_next =3D 0xfc10003f00103fff, list_prev =3D = 0xfffffff010000fb}, { list_next =3D 0x78000300070000, list_prev =3D 0x7800780078}, = {list_next =3D 0x0, list_prev =3D 0x6170e001f0000}}, dn_mtx =3D = {lock_object =3D { lo_name =3D 0x2901fc0044005e
, lo_flags =3D 2104063083, lo_data =3D 1953056627,=20 lo_witness =3D 0x154407f4763bc41}, sx_lock =3D 281466386776064}, = dn_dirty_records =3D {{list_size =3D 70088136784871424,=20 list_offset =3D 6724054219973732112, list_head =3D {list_next =3D = 0xca00000001, list_prev =3D 0xcca250005a874000}}, {list_size =3D = 2807702982,=20 list_offset =3D 0, list_head =3D {list_next =3D 0x401c409c0000, = list_prev =3D 0x0}}, {list_size =3D 3096229038784512,=20 list_offset =3D 1004619926008233984, list_head =3D {list_next =3D = 0x200440000001fa20, list_prev =3D 0x7000000000230}}, {list_size =3D = 2258422671148295,=20 list_offset =3D 74596396452349191, list_head =3D {list_next =3D = 0x4235413700060308, list_prev =3D 0x5db90000180a0000}}}, dn_ranges =3D = {{ avl_root =3D 0x8000ffff, avl_compar =3D 0, avl_offset =3D = 8589934592, avl_numnodes =3D 0, avl_size =3D 0}, {avl_root =3D 0x0, = avl_compar =3D 0,=20 avl_offset =3D 0, avl_numnodes =3D 0, avl_size =3D 0}, {avl_root =3D= 0x0, avl_compar =3D 0, avl_offset =3D 61, avl_numnodes =3D 0,=20 avl_size =3D 2026619832316723200}, {avl_root =3D 0x0, avl_compar =3D= 0x21101f, avl_offset =3D 0, avl_numnodes =3D 0, avl_size =3D = 65011713}},=20 dn_allocated_txg =3D 0, dn_free_txg =3D 0, dn_assigned_txg =3D 0, = dn_notxholds =3D {cv_description =3D 0x0, cv_waiters =3D 1503985664},=20 dn_dirtyctx =3D DN_UNDIRTIED, dn_dirtyctx_firstset =3D 0x0, = dn_tx_holds =3D {rc_count =3D 0}, dn_holds =3D {rc_count =3D 1}, = dn_dbufs_mtx =3D {lock_object =3D { lo_name =3D 0xffffffff81100b9e "dn->dn_dbufs_mtx", lo_flags =3D = 40960000, lo_data =3D 0, lo_witness =3D 0x0}, sx_lock =3D 1}, dn_dbufs =3D= { list_size =3D 224, list_offset =3D 176, list_head =3D {list_next =3D = 0xffffff07072075f0, list_prev =3D 0xffffff07072075f0}},=20 dn_bonus =3D 0xffffff0abe353700, dn_zio =3D 0x0, dn_zfetch =3D = {zf_rwlock =3D {lock_object =3D {lo_name =3D 0xffffffff81100ddf = "zf->zf_rwlock",=20 lo_flags =3D 40960000, lo_data =3D 0, lo_witness =3D 0x0}, = sx_lock =3D 1}, zf_stream =3D {list_size =3D 112, list_offset =3D 88, = list_head =3D { list_next =3D 0xffffff159f7c5dd8, list_prev =3D = 0xffffff159f7c5dd8}}, zf_dnode =3D 0xffffff008040e000, zf_stream_cnt =3D = 1, zf_alloc_fail =3D 1}} (kgdb)=20= From owner-freebsd-fs@FreeBSD.ORG Wed Aug 24 09:24:23 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A70D51065679 for ; Wed, 24 Aug 2011 09:24:23 +0000 (UTC) (envelope-from freebsd-fs@m.gmane.org) Received: from lo.gmane.org (lo.gmane.org [80.91.229.12]) by mx1.freebsd.org (Postfix) with ESMTP id 656ED8FC14 for ; Wed, 24 Aug 2011 09:24:23 +0000 (UTC) Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1Qw9h8-0002a0-1B for freebsd-fs@freebsd.org; Wed, 24 Aug 2011 11:24:22 +0200 Received: from lara.cc.fer.hr ([161.53.72.113]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 24 Aug 2011 11:24:22 +0200 Received: from ivoras by lara.cc.fer.hr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 24 Aug 2011 11:24:22 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-fs@freebsd.org From: Ivan Voras Date: Wed, 24 Aug 2011 11:24:06 +0200 Lines: 18 Message-ID: References: <12360197.20110824002312@serebryakov.spb.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: lara.cc.fer.hr User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.12) Gecko/20101102 Thunderbird/3.1.6 In-Reply-To: <12360197.20110824002312@serebryakov.spb.ru> X-Enigmail-Version: 1.1.2 Subject: Re: Is growfs(8) safe? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Aug 2011 09:24:23 -0000 On 23/08/2011 22:23, Lev Serebryakov wrote: > Hello, Freebsd-fs. > > Is growfs(8) safe now? I remember, there were some problems with > non-zeroed data, but it seems to be fixed. > > But it HAS one bug for sure: "int p_size," which overflows on big > FSes. It is in sectors, and 32 bit signed, so max value is only > slightly less than 1Tb (!) I use it approximately 1-2 times a year to resize virtual machines (though every time < 1 TB). It worked fine so far. > Does somebody use it? Or everybody migrate to ZFS already? I'll reconsider moving to ZFS when there's a week without a critical bug being reported ;) From owner-freebsd-fs@FreeBSD.ORG Wed Aug 24 14:14:03 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E7817106566C for ; Wed, 24 Aug 2011 14:14:03 +0000 (UTC) (envelope-from feld@feld.me) Received: from mwi1.coffeenet.org (unknown [IPv6:2607:f4e0:100:300::2]) by mx1.freebsd.org (Postfix) with ESMTP id C199E8FC12 for ; Wed, 24 Aug 2011 14:14:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=feld.me; s=blargle; h=In-Reply-To:Message-Id:From:Mime-Version:Date:References:Subject:To:Content-Type; bh=SSTN5nr5psjxp80eaZ9Xa0wOLH0dYNJKZrOI/AE9FWQ=; b=RgbAFu0XQDqTjFgnvyV650sSg/Ra6n6BA1f377wcFtw1b9zT3+QUAs0Fsl8HjLJwj0lPYkb8+0+WQUIOTpq+WKlKqLtT3nX3oZJYqwtFCEgkxGmOyosLat6UHAVdKvvR; Received: from localhost ([127.0.0.1] helo=mwi1.coffeenet.org) by mwi1.coffeenet.org with esmtp (Exim 4.76 (FreeBSD)) (envelope-from ) id 1QwEGg-0003LY-IJ for freebsd-fs@freebsd.org; Wed, 24 Aug 2011 09:17:23 -0500 Received: from feld@feld.me by mwi1.coffeenet.org (Archiveopteryx 3.1.3) with esmtpsa id 1314195436-68202-68201/4/18; Wed, 24 Aug 2011 14:17:16 +0000 Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes To: freebsd-fs@freebsd.org References: <12360197.20110824002312@serebryakov.spb.ru> Date: Wed, 24 Aug 2011 09:13:53 -0500 Mime-Version: 1.0 From: Mark Felder Message-Id: In-Reply-To: <12360197.20110824002312@serebryakov.spb.ru> User-Agent: Opera Mail/12.00 (FreeBSD) X-SA-Score: -1.0 Subject: Re: Is growfs(8) safe? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Aug 2011 14:14:04 -0000 On Tue, 23 Aug 2011 15:23:12 -0500, Lev Serebryakov wrote: > Is growfs(8) safe now? I remember, there were some problems with > non-zeroed data, but it seems to be fixed. We use it regularly at work to grow customer's VMs. Generally they're no larger than like 120GB though. Regards, Mark From owner-freebsd-fs@FreeBSD.ORG Wed Aug 24 14:50:13 2011 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 99F9A106566C for ; Wed, 24 Aug 2011 14:50:13 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 7038A8FC17 for ; Wed, 24 Aug 2011 14:50:13 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p7OEoD4L090782 for ; Wed, 24 Aug 2011 14:50:13 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p7OEoDSI090781; Wed, 24 Aug 2011 14:50:13 GMT (envelope-from gnats) Date: Wed, 24 Aug 2011 14:50:13 GMT Message-Id: <201108241450.p7OEoDSI090781@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org From: Martin Simmons Cc: Subject: Re: kern/153847: [nfs] [panic] Kernel panic from incorrect m_free in nfs_getattr X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Martin Simmons List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Aug 2011 14:50:13 -0000 The following reply was made to PR kern/153847; it has been noted by GNATS. From: Martin Simmons To: bug-followup@FreeBSD.org Cc: Subject: Re: kern/153847: [nfs] [panic] Kernel panic from incorrect m_free in nfs_getattr Date: Wed, 24 Aug 2011 15:37:31 +0100 FTR, I just got this again with the latest 7.4 kernel: FreeBSD 7.4-RELEASE #0: Thu Feb 17 03:51:56 UTC 2011 Unread portion of the kernel message buffer: <6>nfs server pid947@greig:/sp: is alive again Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x819 fault code = supervisor read, page not present instruction pointer = 0x20:0xc086fea0 stack pointer = 0x28:0xc53c9824 frame pointer = 0x28:0xc53c9834 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 19529 (ls) trap number = 12 panic: page fault cpuid = 1 Uptime: 6d22h30m27s Physical memory: 2035 MB Dumping 257 MB: 242 226 210 194 178 162 146 130 114 98 82 66 50 34 18 2 Reading symbols from /boot/kernel/acpi.ko...Reading symbols from /boot/kernel/acpi.ko.symbols...done. done. Loaded symbols for /boot/kernel/acpi.ko #0 doadump () at pcpu.h:197 197 pcpu.h: No such file or directory. in pcpu.h (kgdb) where #0 doadump () at pcpu.h:197 #1 0xc081c693 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:421 #2 0xc081c967 in panic (fmt=Variable "fmt" is not available. ) at /usr/src/sys/kern/kern_shutdown.c:576 #3 0xc0b29a0c in trap_fatal (frame=0xc53c97e4, eva=2073) at /usr/src/sys/i386/i386/trap.c:950 #4 0xc0b29c90 in trap_pfault (frame=0xc53c97e4, usermode=0, eva=2073) at /usr/src/sys/i386/i386/trap.c:863 #5 0xc0b2a66c in trap (frame=0xc53c97e4) at /usr/src/sys/i386/i386/trap.c:541 #6 0xc0b0cf8b in calltrap () at /usr/src/sys/i386/i386/exception.s:166 #7 0xc086fea0 in m_freem (mb=0x819) at /usr/src/sys/kern/uipc_mbuf.c:162 #8 0xc09c8c95 in nfs_getattr (ap=0xc53c9968) at /usr/src/sys/nfsclient/nfs_vnops.c:664 #9 0xc0b3f102 in VOP_GETATTR_APV (vop=0xc0cb6f80, a=0xc53c9968) at vnode_if.c:530 #10 0xc09cb445 in nfs_lookup (ap=0xc53c9a90) at vnode_if.h:286 #11 0xc0b40b06 in VOP_LOOKUP_APV (vop=0xc0cb6f80, a=0xc53c9a90) at vnode_if.c:99 #12 0xc089685b in lookup (ndp=0xc53c9ba8) at vnode_if.h:57 #13 0xc08976de in namei (ndp=0xc53c9ba8) at /usr/src/sys/kern/vfs_lookup.c:234 #14 0xc08a58e4 in kern_stat (td=0xc5c396c0, path=0x28212088
, pathseg=UIO_USERSPACE, sbp=0xc53c9c18) at /usr/src/sys/kern/vfs_syscalls.c:2141 #15 0xc08a5acf in stat (td=0xc5c396c0, uap=0xc53c9cfc) at /usr/src/sys/kern/vfs_syscalls.c:2125 #16 0xc0b29fe5 in syscall (frame=0xc53c9d38) at /usr/src/sys/i386/i386/trap.c:1101 #17 0xc0b0cff0 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:262 #18 0x00000033 in ?? () From owner-freebsd-fs@FreeBSD.ORG Wed Aug 24 21:28:42 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 96008106564A; Wed, 24 Aug 2011 21:28:42 +0000 (UTC) (envelope-from jwd@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 6C6468FC0A; Wed, 24 Aug 2011 21:28:42 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p7OLSg6a059113; Wed, 24 Aug 2011 21:28:42 GMT (envelope-from jwd@freefall.freebsd.org) Received: (from jwd@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p7OLSg1k059112; Wed, 24 Aug 2011 21:28:42 GMT (envelope-from jwd) Date: Wed, 24 Aug 2011 21:28:42 +0000 From: John To: freebsd-current@freebsd.org, freebsd-fs@freebsd.org Message-ID: <20110824212842.GA50140@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.3i Cc: Subject: F_RDLCK lock to FreeBSD NFS server fails to R/O target file X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Aug 2011 21:28:42 -0000 Hi Fellow NFS'ers, I believe I have found the problem we've been having with read locks while attaching to a FreeBSD NFS server. In sys/nlm/nlm_prot_impl.c, function nlm_get_vfs_state(), there is a call to VOP_ACCESS() as follows: /* * Check cred. */ NLM_DEBUG(3, "nlm_get_vfs_state(): Calling VOP_ACCESS(VWRITE) with cred->cr_uid=%d\n",cred->cr_uid); error = VOP_ACCESS(vs->vs_vp, VWRITE, cred, curthread); if (error) { NLM_DEBUG(3, "nlm_get_vfs_state(): caller_name = %s VOP_ACCESS() returns %d\n", host->nh_caller_name, error); goto out; } The file being accessed is read only to the user, and open()ed with O_RDONLY. The lock being requested is for a read. fd = open(filename, O_RDONLY, 0); ... lblk.l_type = F_RDLCK; lblk.l_start = 0; lblk.l_whence= SEEK_SET; lblk.l_len = 0; lblk.l_pid = 0; rc = fcntl(fd, F_SETLK, &lblk); Running the above from a remote system, the lock call fails with errno set to ENOLCK. Given cred->cr_uid comes in as 227 which is my uid on the remote system. Since the file is R/O to me, and the VOP_ACCESS() is asking for VWRITE, it fails with errno 13, EACCES, Permission denied. The above operations work correctly to some of our other favorite big-name nfs vendors :-) Opinions on the "correct" way to fix this? 1. Since we're only asking for a read lock, why do we need to ask for VWRITE? I may not understand an underlying requirement for the VWRITE so please feel free to educate me if needed. Something like: request == F_RDLCK ? VREAD : VWRITE (need to figure out where to get the request from in this context). 2. Attempt VWRITE, fallback to VREAD... seems off to me though. 3. Other? I appreciate any thoughts on this. Thanks, John While they might not follow style(9) completely, I've uploaded my patch to nlm_prot.impl.c with the NLM_DEBUG() calls i've added. I'd appreciate it if someone would consider committing them so who ever debugs this file next will have them available. http://people.freebsd.org/~jwd/nlm_prot_impl.c.patch From owner-freebsd-fs@FreeBSD.ORG Thu Aug 25 08:20:13 2011 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C8D1C1065674 for ; Thu, 25 Aug 2011 08:20:13 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id AC5E08FC17 for ; Thu, 25 Aug 2011 08:20:13 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p7P8KDk0095953 for ; Thu, 25 Aug 2011 08:20:13 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p7P8KDlQ095952; Thu, 25 Aug 2011 08:20:13 GMT (envelope-from gnats) Date: Thu, 25 Aug 2011 08:20:13 GMT Message-Id: <201108250820.p7P8KDlQ095952@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org From: dfilter@FreeBSD.ORG (dfilter service) Cc: Subject: Re: kern/160035: commit references a PR X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: dfilter service List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Aug 2011 08:20:13 -0000 The following reply was made to PR kern/160035; it has been noted by GNATS. From: dfilter@FreeBSD.ORG (dfilter service) To: bug-followup@FreeBSD.org Cc: Subject: Re: kern/160035: commit references a PR Date: Thu, 25 Aug 2011 08:17:55 +0000 (UTC) Author: mm Date: Thu Aug 25 08:17:39 2011 New Revision: 225166 URL: http://svn.freebsd.org/changeset/base/225166 Log: Generalize ffs_pages_remove() into vn_pages_remove(). Remove mapped pages for all dataset vnodes in zfs_rezget() using new vn_pages_remove() to fix mmapped files changed by zfs rollback or zfs receive -F. PR: kern/160035, kern/156933 Reviewed by: kib, pjd Approved by: re (kib) MFC after: 1 week Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c head/sys/kern/vfs_vnops.c head/sys/sys/vnode.h head/sys/ufs/ffs/ffs_extern.h head/sys/ufs/ffs/ffs_inode.c head/sys/ufs/ffs/ffs_softdep.c Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c Thu Aug 25 07:28:07 2011 (r225165) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c Thu Aug 25 08:17:39 2011 (r225166) @@ -1273,6 +1273,7 @@ zfs_rezget(znode_t *zp) zfsvfs_t *zfsvfs = zp->z_zfsvfs; dmu_object_info_t doi; dmu_buf_t *db; + vnode_t *vp; uint64_t obj_num = zp->z_id; uint64_t mode, size; sa_bulk_attr_t bulk[8]; @@ -1348,8 +1349,9 @@ zfs_rezget(znode_t *zp) * that for example regular file was replaced with directory * which has the same object number. */ - if (ZTOV(zp) != NULL && - ZTOV(zp)->v_type != IFTOVT((mode_t)zp->z_mode)) { + vp = ZTOV(zp); + if (vp != NULL && + vp->v_type != IFTOVT((mode_t)zp->z_mode)) { zfs_znode_dmu_fini(zp); ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num); return (EIO); @@ -1357,8 +1359,11 @@ zfs_rezget(znode_t *zp) zp->z_unlinked = (zp->z_links == 0); zp->z_blksz = doi.doi_data_block_size; - if (zp->z_size != size && ZTOV(zp) != NULL) - vnode_pager_setsize(ZTOV(zp), zp->z_size); + if (vp != NULL) { + vn_pages_remove(vp, 0, 0); + if (zp->z_size != size) + vnode_pager_setsize(vp, zp->z_size); + } ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num); Modified: head/sys/kern/vfs_vnops.c ============================================================================== --- head/sys/kern/vfs_vnops.c Thu Aug 25 07:28:07 2011 (r225165) +++ head/sys/kern/vfs_vnops.c Thu Aug 25 08:17:39 2011 (r225166) @@ -64,6 +64,9 @@ __FBSDID("$FreeBSD$"); #include #include +#include +#include + static fo_rdwr_t vn_read; static fo_rdwr_t vn_write; static fo_truncate_t vn_truncate; @@ -1398,3 +1401,15 @@ vn_chown(struct file *fp, uid_t uid, gid VFS_UNLOCK_GIANT(vfslocked); return (error); } + +void +vn_pages_remove(struct vnode *vp, vm_pindex_t start, vm_pindex_t end) +{ + vm_object_t object; + + if ((object = vp->v_object) == NULL) + return; + VM_OBJECT_LOCK(object); + vm_object_page_remove(object, start, end, 0); + VM_OBJECT_UNLOCK(object); +} Modified: head/sys/sys/vnode.h ============================================================================== --- head/sys/sys/vnode.h Thu Aug 25 07:28:07 2011 (r225165) +++ head/sys/sys/vnode.h Thu Aug 25 08:17:39 2011 (r225166) @@ -640,6 +640,7 @@ int _vn_lock(struct vnode *vp, int flags int vn_open(struct nameidata *ndp, int *flagp, int cmode, struct file *fp); int vn_open_cred(struct nameidata *ndp, int *flagp, int cmode, u_int vn_open_flags, struct ucred *cred, struct file *fp); +void vn_pages_remove(struct vnode *vp, vm_pindex_t start, vm_pindex_t end); int vn_pollrecord(struct vnode *vp, struct thread *p, int events); int vn_rdwr(enum uio_rw rw, struct vnode *vp, void *base, int len, off_t offset, enum uio_seg segflg, int ioflg, Modified: head/sys/ufs/ffs/ffs_extern.h ============================================================================== --- head/sys/ufs/ffs/ffs_extern.h Thu Aug 25 07:28:07 2011 (r225165) +++ head/sys/ufs/ffs/ffs_extern.h Thu Aug 25 08:17:39 2011 (r225166) @@ -79,7 +79,6 @@ int ffs_isfreeblock(struct fs *, u_char void ffs_load_inode(struct buf *, struct inode *, struct fs *, ino_t); int ffs_mountroot(void); void ffs_oldfscompat_write(struct fs *, struct ufsmount *); -void ffs_pages_remove(struct vnode *vp, vm_pindex_t start, vm_pindex_t end); int ffs_reallocblks(struct vop_reallocblks_args *); int ffs_realloccg(struct inode *, ufs2_daddr_t, ufs2_daddr_t, ufs2_daddr_t, int, int, int, struct ucred *, struct buf **); Modified: head/sys/ufs/ffs/ffs_inode.c ============================================================================== --- head/sys/ufs/ffs/ffs_inode.c Thu Aug 25 07:28:07 2011 (r225165) +++ head/sys/ufs/ffs/ffs_inode.c Thu Aug 25 08:17:39 2011 (r225166) @@ -120,18 +120,6 @@ ffs_update(vp, waitfor) } } -void -ffs_pages_remove(struct vnode *vp, vm_pindex_t start, vm_pindex_t end) -{ - vm_object_t object; - - if ((object = vp->v_object) == NULL) - return; - VM_OBJECT_LOCK(object); - vm_object_page_remove(object, start, end, 0); - VM_OBJECT_UNLOCK(object); -} - #define SINGLE 0 /* index of single indirect block */ #define DOUBLE 1 /* index of double indirect block */ #define TRIPLE 2 /* index of triple indirect block */ @@ -219,7 +207,7 @@ ffs_truncate(vp, length, flags, cred, td (void) chkdq(ip, -extblocks, NOCRED, 0); #endif vinvalbuf(vp, V_ALT, 0, 0); - ffs_pages_remove(vp, + vn_pages_remove(vp, OFF_TO_IDX(lblktosize(fs, -extblocks)), 0); osize = ip->i_din2->di_extsize; ip->i_din2->di_blocks -= extblocks; Modified: head/sys/ufs/ffs/ffs_softdep.c ============================================================================== --- head/sys/ufs/ffs/ffs_softdep.c Thu Aug 25 07:28:07 2011 (r225165) +++ head/sys/ufs/ffs/ffs_softdep.c Thu Aug 25 08:17:39 2011 (r225166) @@ -6541,7 +6541,7 @@ trunc_pages(ip, length, extblocks, flags fs = ip->i_fs; extend = OFF_TO_IDX(lblktosize(fs, -extblocks)); if ((flags & IO_EXT) != 0) - ffs_pages_remove(vp, extend, 0); + vn_pages_remove(vp, extend, 0); if ((flags & IO_NORMAL) == 0) return; BO_LOCK(&vp->v_bufobj); @@ -6567,7 +6567,7 @@ trunc_pages(ip, length, extblocks, flags end = OFF_TO_IDX(lblktosize(fs, lbn)); } else end = extend; - ffs_pages_remove(vp, OFF_TO_IDX(OFF_MAX), end); + vn_pages_remove(vp, OFF_TO_IDX(OFF_MAX), end); } /* _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Thu Aug 25 08:20:17 2011 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7E9DC106564A for ; Thu, 25 Aug 2011 08:20:17 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 6E4B28FC16 for ; Thu, 25 Aug 2011 08:20:17 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p7P8KHv5095999 for ; Thu, 25 Aug 2011 08:20:17 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p7P8KHqG095992; Thu, 25 Aug 2011 08:20:17 GMT (envelope-from gnats) Date: Thu, 25 Aug 2011 08:20:17 GMT Message-Id: <201108250820.p7P8KHqG095992@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org From: dfilter@FreeBSD.ORG (dfilter service) Cc: Subject: Re: kern/156933: commit references a PR X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: dfilter service List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Aug 2011 08:20:17 -0000 The following reply was made to PR kern/156933; it has been noted by GNATS. From: dfilter@FreeBSD.ORG (dfilter service) To: bug-followup@FreeBSD.org Cc: Subject: Re: kern/156933: commit references a PR Date: Thu, 25 Aug 2011 08:17:55 +0000 (UTC) Author: mm Date: Thu Aug 25 08:17:39 2011 New Revision: 225166 URL: http://svn.freebsd.org/changeset/base/225166 Log: Generalize ffs_pages_remove() into vn_pages_remove(). Remove mapped pages for all dataset vnodes in zfs_rezget() using new vn_pages_remove() to fix mmapped files changed by zfs rollback or zfs receive -F. PR: kern/160035, kern/156933 Reviewed by: kib, pjd Approved by: re (kib) MFC after: 1 week Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c head/sys/kern/vfs_vnops.c head/sys/sys/vnode.h head/sys/ufs/ffs/ffs_extern.h head/sys/ufs/ffs/ffs_inode.c head/sys/ufs/ffs/ffs_softdep.c Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c Thu Aug 25 07:28:07 2011 (r225165) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c Thu Aug 25 08:17:39 2011 (r225166) @@ -1273,6 +1273,7 @@ zfs_rezget(znode_t *zp) zfsvfs_t *zfsvfs = zp->z_zfsvfs; dmu_object_info_t doi; dmu_buf_t *db; + vnode_t *vp; uint64_t obj_num = zp->z_id; uint64_t mode, size; sa_bulk_attr_t bulk[8]; @@ -1348,8 +1349,9 @@ zfs_rezget(znode_t *zp) * that for example regular file was replaced with directory * which has the same object number. */ - if (ZTOV(zp) != NULL && - ZTOV(zp)->v_type != IFTOVT((mode_t)zp->z_mode)) { + vp = ZTOV(zp); + if (vp != NULL && + vp->v_type != IFTOVT((mode_t)zp->z_mode)) { zfs_znode_dmu_fini(zp); ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num); return (EIO); @@ -1357,8 +1359,11 @@ zfs_rezget(znode_t *zp) zp->z_unlinked = (zp->z_links == 0); zp->z_blksz = doi.doi_data_block_size; - if (zp->z_size != size && ZTOV(zp) != NULL) - vnode_pager_setsize(ZTOV(zp), zp->z_size); + if (vp != NULL) { + vn_pages_remove(vp, 0, 0); + if (zp->z_size != size) + vnode_pager_setsize(vp, zp->z_size); + } ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num); Modified: head/sys/kern/vfs_vnops.c ============================================================================== --- head/sys/kern/vfs_vnops.c Thu Aug 25 07:28:07 2011 (r225165) +++ head/sys/kern/vfs_vnops.c Thu Aug 25 08:17:39 2011 (r225166) @@ -64,6 +64,9 @@ __FBSDID("$FreeBSD$"); #include #include +#include +#include + static fo_rdwr_t vn_read; static fo_rdwr_t vn_write; static fo_truncate_t vn_truncate; @@ -1398,3 +1401,15 @@ vn_chown(struct file *fp, uid_t uid, gid VFS_UNLOCK_GIANT(vfslocked); return (error); } + +void +vn_pages_remove(struct vnode *vp, vm_pindex_t start, vm_pindex_t end) +{ + vm_object_t object; + + if ((object = vp->v_object) == NULL) + return; + VM_OBJECT_LOCK(object); + vm_object_page_remove(object, start, end, 0); + VM_OBJECT_UNLOCK(object); +} Modified: head/sys/sys/vnode.h ============================================================================== --- head/sys/sys/vnode.h Thu Aug 25 07:28:07 2011 (r225165) +++ head/sys/sys/vnode.h Thu Aug 25 08:17:39 2011 (r225166) @@ -640,6 +640,7 @@ int _vn_lock(struct vnode *vp, int flags int vn_open(struct nameidata *ndp, int *flagp, int cmode, struct file *fp); int vn_open_cred(struct nameidata *ndp, int *flagp, int cmode, u_int vn_open_flags, struct ucred *cred, struct file *fp); +void vn_pages_remove(struct vnode *vp, vm_pindex_t start, vm_pindex_t end); int vn_pollrecord(struct vnode *vp, struct thread *p, int events); int vn_rdwr(enum uio_rw rw, struct vnode *vp, void *base, int len, off_t offset, enum uio_seg segflg, int ioflg, Modified: head/sys/ufs/ffs/ffs_extern.h ============================================================================== --- head/sys/ufs/ffs/ffs_extern.h Thu Aug 25 07:28:07 2011 (r225165) +++ head/sys/ufs/ffs/ffs_extern.h Thu Aug 25 08:17:39 2011 (r225166) @@ -79,7 +79,6 @@ int ffs_isfreeblock(struct fs *, u_char void ffs_load_inode(struct buf *, struct inode *, struct fs *, ino_t); int ffs_mountroot(void); void ffs_oldfscompat_write(struct fs *, struct ufsmount *); -void ffs_pages_remove(struct vnode *vp, vm_pindex_t start, vm_pindex_t end); int ffs_reallocblks(struct vop_reallocblks_args *); int ffs_realloccg(struct inode *, ufs2_daddr_t, ufs2_daddr_t, ufs2_daddr_t, int, int, int, struct ucred *, struct buf **); Modified: head/sys/ufs/ffs/ffs_inode.c ============================================================================== --- head/sys/ufs/ffs/ffs_inode.c Thu Aug 25 07:28:07 2011 (r225165) +++ head/sys/ufs/ffs/ffs_inode.c Thu Aug 25 08:17:39 2011 (r225166) @@ -120,18 +120,6 @@ ffs_update(vp, waitfor) } } -void -ffs_pages_remove(struct vnode *vp, vm_pindex_t start, vm_pindex_t end) -{ - vm_object_t object; - - if ((object = vp->v_object) == NULL) - return; - VM_OBJECT_LOCK(object); - vm_object_page_remove(object, start, end, 0); - VM_OBJECT_UNLOCK(object); -} - #define SINGLE 0 /* index of single indirect block */ #define DOUBLE 1 /* index of double indirect block */ #define TRIPLE 2 /* index of triple indirect block */ @@ -219,7 +207,7 @@ ffs_truncate(vp, length, flags, cred, td (void) chkdq(ip, -extblocks, NOCRED, 0); #endif vinvalbuf(vp, V_ALT, 0, 0); - ffs_pages_remove(vp, + vn_pages_remove(vp, OFF_TO_IDX(lblktosize(fs, -extblocks)), 0); osize = ip->i_din2->di_extsize; ip->i_din2->di_blocks -= extblocks; Modified: head/sys/ufs/ffs/ffs_softdep.c ============================================================================== --- head/sys/ufs/ffs/ffs_softdep.c Thu Aug 25 07:28:07 2011 (r225165) +++ head/sys/ufs/ffs/ffs_softdep.c Thu Aug 25 08:17:39 2011 (r225166) @@ -6541,7 +6541,7 @@ trunc_pages(ip, length, extblocks, flags fs = ip->i_fs; extend = OFF_TO_IDX(lblktosize(fs, -extblocks)); if ((flags & IO_EXT) != 0) - ffs_pages_remove(vp, extend, 0); + vn_pages_remove(vp, extend, 0); if ((flags & IO_NORMAL) == 0) return; BO_LOCK(&vp->v_bufobj); @@ -6567,7 +6567,7 @@ trunc_pages(ip, length, extblocks, flags end = OFF_TO_IDX(lblktosize(fs, lbn)); } else end = extend; - ffs_pages_remove(vp, OFF_TO_IDX(OFF_MAX), end); + vn_pages_remove(vp, OFF_TO_IDX(OFF_MAX), end); } /* _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Thu Aug 25 17:47:47 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 40603106564A; Thu, 25 Aug 2011 17:47:47 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 1A8B48FC13; Thu, 25 Aug 2011 17:47:47 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id AA49146B0C; Thu, 25 Aug 2011 13:47:46 -0400 (EDT) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id E56638A02F; Thu, 25 Aug 2011 13:47:45 -0400 (EDT) From: John Baldwin To: Rick Macklem Date: Thu, 25 Aug 2011 13:47:45 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110617; KDE/4.5.5; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Message-Id: <201108251347.45460.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Thu, 25 Aug 2011 13:47:46 -0400 (EDT) Cc: fs@freebsd.org Subject: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Aug 2011 17:47:47 -0000 I was doing some analysis of compiles over NFS at work recently and noticed from 'iostat 1' on the NFS server that all my NFS writes were always 16k writes (meaning that writes were never being clustered). I added some debugging sysctls to the NFS client and server code as well as the FFS write VOP to figure out the various kind of write requests that were being sent. = I found that during the NFS compile, the NFS client was sending a lot of =46ILESYNC writes even though nothing in the compile process uses fsync(). Based on the debugging I added, I found that all of the FILESYNC writes were marked as such because the buffer in question did not have B_ASYNC set: if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) =3D= =3D B_ASYNC) iomode =3D NFSV3WRITE_UNSTABLE; else iomode =3D NFSV3WRITE_FILESYNC; I eventually tracked this down to the code in the NFS client that pushes ou= t a previous dirty region via 'bwrite()' when a write would dirty a non-contigu= ous region in the buffer: if (bp->b_dirtyend > 0 && (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) { if (bwrite(bp) =3D=3D EINTR) { error =3D EINTR; break; } goto again; } (These writes are triggered during the compile of a file by the assembler seeking back into the file it has already written out to apply various fixups.) =46rom this I concluded that the test above is flawed. We should be using UNSTABLE writes for the writes above as the user has not requested them to be synchronous. The issue (I believe) is that the NFS client is overloading the B_ASYNC flag. The B_ASYNC flag means that the caller of bwrite() (or rather bawrite()) is not synchronously blocking to see if the request has completed. Instead, it is a "fire and forget". This is not the same thing as the IO_SYNC flag passed in ioflags during a write request which requests fsync()-like behavior. To disambiguate the two I added a new B_SYNC flag and changed the NFS clients to set this for write requests with IO_SYNC set. I then updated the condition above to instead check for B_SYNC being set rather than checking for B_ASYNC being clear. That converted all the FILESYNC write RPCs from my builds into UNSTABLE write RPCs. The patch for that is at http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch. However, even with this change I was still not getting clustered writes on the NFS server (all writes were still 16k). After digging around in the code for a bit I found that ffs will only cluster writes if the passed in 'ioflags' to ffs_write() specify a sequential hint. I then noticed that the NFS server has code to keep track of sequential I/O heuristics for reads, but not writes. I took the code from the NFS server's read op and moved it into a function to compute a sequential I/O heuristic that could be shared by both reads and writes. I also updated the sequential heuristic code to advance the counter based on the number of 16k blocks in each write instead of just doing ++ to match what we do for local file writes in sequential_heuristic() in vfs_vnops.c. Using this did give me some measure of NFS write clustering (though I can't peg my disks at MAXPHYS the way a dd to a file on a local filesystem can). The patch for these changes is at http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch (This also fixes a bug in the new NFS server in that it wasn't actually clustering reads since it never updated nh->nh_nextr.) Combining the two changes together gave me about a 1% reduction in wall time for my builds: +--------------------------------------------------------------------------= =2D---+ |+ + ++ + +x++*x xx+x x = x| | |___________A__|_M_______|_A____________| = | +--------------------------------------------------------------------------= =2D---+ N Min Max Median Avg Stddev x 10 1869.62 1943.11 1881.89 1886.12 21.549724 + 10 1809.71 1886.53 1869.26 1860.706 21.530664 Difference at 95.0% confidence -25.414 +/- 20.2391 -1.34742% +/- 1.07305% (Student's t, pooled s =3D 21.5402) One caveat: I tested both of these patches on the old NFS client and server on 8.2-stable. I then ported the changes to the new client and server and while I made sure they compiled, I have not tested the new client and serve= r. =2D-=20 John Baldwin From owner-freebsd-fs@FreeBSD.ORG Thu Aug 25 20:31:52 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3026A106566B; Thu, 25 Aug 2011 20:31:52 +0000 (UTC) (envelope-from jwd@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 04C678FC12; Thu, 25 Aug 2011 20:31:52 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p7PKVpgs080537; Thu, 25 Aug 2011 20:31:51 GMT (envelope-from jwd@freefall.freebsd.org) Received: (from jwd@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p7PKVp3T080536; Thu, 25 Aug 2011 20:31:51 GMT (envelope-from jwd) Date: Thu, 25 Aug 2011 20:31:51 +0000 From: John To: freebsd-current@freebsd.org, freebsd-fs@freebsd.org Message-ID: <20110825203151.GA61776@FreeBSD.org> References: <20110824212842.GA50140@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110824212842.GA50140@FreeBSD.org> User-Agent: Mutt/1.4.2.3i Cc: Subject: Re: F_RDLCK lock to FreeBSD NFS server fails to R/O target file [PATCH] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Aug 2011 20:31:52 -0000 After pondering the best way to allow the VOP_ACCESS() call to only query for the permissions really needed, I've come up with a patch that minimally adds one parameter to the nlm_get_vfs_state() function call with the lock type from the original argp. http://people.freebsd.org/~jwd/nlm_prot_impl.c.accmode.patch I'd appreciate a review and seeing what might be required to commit this prior to 9 release. Thanks, John ----- John's Original Message ----- > Hi Fellow NFS'ers, > > I believe I have found the problem we've been having with read locks > while attaching to a FreeBSD NFS server. > > In sys/nlm/nlm_prot_impl.c, function nlm_get_vfs_state(), there is a call > to VOP_ACCESS() as follows: > > /* > * Check cred. > */ > NLM_DEBUG(3, "nlm_get_vfs_state(): Calling VOP_ACCESS(VWRITE) with cred->cr_uid=%d\n",cred->cr_uid); > error = VOP_ACCESS(vs->vs_vp, VWRITE, cred, curthread); > if (error) { > NLM_DEBUG(3, "nlm_get_vfs_state(): caller_name = %s VOP_ACCESS() returns %d\n", > host->nh_caller_name, error); > goto out; > } > > The file being accessed is read only to the user, and open()ed with > O_RDONLY. The lock being requested is for a read. > > fd = open(filename, O_RDONLY, 0); > ... > > lblk.l_type = F_RDLCK; > lblk.l_start = 0; > lblk.l_whence= SEEK_SET; > lblk.l_len = 0; > lblk.l_pid = 0; > rc = fcntl(fd, F_SETLK, &lblk); > > Running the above from a remote system, the lock call fails with > errno set to ENOLCK. Given cred->cr_uid comes in as 227 which is > my uid on the remote system. Since the file is R/O to me, and the > VOP_ACCESS() is asking for VWRITE, it fails with errno 13, EACCES, > Permission denied. > > The above operations work correctly to some of our other > favorite big-name nfs vendors :-) > > Opinions on the "correct" way to fix this? > > 1. Since we're only asking for a read lock, why do we need to ask > for VWRITE? I may not understand an underlying requirement for > the VWRITE so please feel free to educate me if needed. > > Something like: request == F_RDLCK ? VREAD : VWRITE > (need to figure out where to get the request from in this context). > > 2. Attempt VWRITE, fallback to VREAD... seems off to me though. > > 3. Other? > > I appreciate any thoughts on this. > > Thanks, > John > > While they might not follow style(9) completely, I've uploaded > my patch to nlm_prot.impl.c with the NLM_DEBUG() calls i've added. > I'd appreciate it if someone would consider committing them so > who ever debugs this file next will have them available. > > http://people.freebsd.org/~jwd/nlm_prot_impl.c.patch > From owner-freebsd-fs@FreeBSD.ORG Thu Aug 25 20:45:50 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 480B5106564A; Thu, 25 Aug 2011 20:45:50 +0000 (UTC) (envelope-from jwd@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 34CD48FC15; Thu, 25 Aug 2011 20:45:50 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p7PKjoWs089759; Thu, 25 Aug 2011 20:45:50 GMT (envelope-from jwd@freefall.freebsd.org) Received: (from jwd@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p7PKjouO089758; Thu, 25 Aug 2011 20:45:50 GMT (envelope-from jwd) Date: Thu, 25 Aug 2011 20:45:49 +0000 From: John To: John Baldwin Message-ID: <20110825204549.GB61776@FreeBSD.org> References: <201108251347.45460.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201108251347.45460.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Aug 2011 20:45:50 -0000 Hi John, This is an interesting fix. If I can I'll try patching a few systems and giving it a try. I don't know if this would help for timing comparisons, but years ago we used to run build work directly against our NFS storage. In general, we moved away from that to a two stage approach: cc foo.c -o /tmp/foo.o # where /tmp is a memory filesystem cp /tmp/foo.o /nfs/mounted/target/area/foo.o This provided for a very large performance boost. It's worth noting that different compilers require different levels of arm-wrestling to convince them to use the file specifed with -o correctly (and directly). With a simple .mk file change you could probably get an up-to-date comparison of the current system vs your patch vs sequential i/o only. I'll let you know what I find and if we see any regressions. Thanks, John ----- John Baldwin's Original Message ----- > I was doing some analysis of compiles over NFS at work recently and noticed > from 'iostat 1' on the NFS server that all my NFS writes were always 16k > writes (meaning that writes were never being clustered). I added some > debugging sysctls to the NFS client and server code as well as the FFS write > VOP to figure out the various kind of write requests that were being sent. I > found that during the NFS compile, the NFS client was sending a lot of > FILESYNC writes even though nothing in the compile process uses fsync(). > Based on the debugging I added, I found that all of the FILESYNC writes were > marked as such because the buffer in question did not have B_ASYNC set: > > > if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) == B_ASYNC) > iomode = NFSV3WRITE_UNSTABLE; > else > iomode = NFSV3WRITE_FILESYNC; > > I eventually tracked this down to the code in the NFS client that pushes out a > previous dirty region via 'bwrite()' when a write would dirty a non-contiguous > region in the buffer: > > if (bp->b_dirtyend > 0 && > (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) { > if (bwrite(bp) == EINTR) { > error = EINTR; > break; > } > goto again; > } > > (These writes are triggered during the compile of a file by the assembler > seeking back into the file it has already written out to apply various > fixups.) > > From this I concluded that the test above is flawed. We should be using > UNSTABLE writes for the writes above as the user has not requested them to > be synchronous. The issue (I believe) is that the NFS client is overloading > the B_ASYNC flag. The B_ASYNC flag means that the caller of bwrite() > (or rather bawrite()) is not synchronously blocking to see if the request > has completed. Instead, it is a "fire and forget". This is not the same > thing as the IO_SYNC flag passed in ioflags during a write request which > requests fsync()-like behavior. To disambiguate the two I added a new > B_SYNC flag and changed the NFS clients to set this for write requests > with IO_SYNC set. I then updated the condition above to instead check for > B_SYNC being set rather than checking for B_ASYNC being clear. > > That converted all the FILESYNC write RPCs from my builds into UNSTABLE > write RPCs. The patch for that is at > http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch. > > However, even with this change I was still not getting clustered writes on > the NFS server (all writes were still 16k). After digging around in the > code for a bit I found that ffs will only cluster writes if the passed in > 'ioflags' to ffs_write() specify a sequential hint. I then noticed that > the NFS server has code to keep track of sequential I/O heuristics for > reads, but not writes. I took the code from the NFS server's read op > and moved it into a function to compute a sequential I/O heuristic that > could be shared by both reads and writes. I also updated the sequential > heuristic code to advance the counter based on the number of 16k blocks > in each write instead of just doing ++ to match what we do for local > file writes in sequential_heuristic() in vfs_vnops.c. Using this did > give me some measure of NFS write clustering (though I can't peg my > disks at MAXPHYS the way a dd to a file on a local filesystem can). The > patch for these changes is at > http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch > > (This also fixes a bug in the new NFS server in that it wasn't actually > clustering reads since it never updated nh->nh_nextr.) > > Combining the two changes together gave me about a 1% reduction in wall > time for my builds: > > +------------------------------------------------------------------------------+ > |+ + ++ + +x++*x xx+x x x| > | |___________A__|_M_______|_A____________| | > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 10 1869.62 1943.11 1881.89 1886.12 21.549724 > + 10 1809.71 1886.53 1869.26 1860.706 21.530664 > Difference at 95.0% confidence > -25.414 +/- 20.2391 > -1.34742% +/- 1.07305% > (Student's t, pooled s = 21.5402) > > One caveat: I tested both of these patches on the old NFS client and server > on 8.2-stable. I then ported the changes to the new client and server and > while I made sure they compiled, I have not tested the new client and server. > > -- > John Baldwin From owner-freebsd-fs@FreeBSD.ORG Thu Aug 25 21:06:44 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8EE081065670 for ; Thu, 25 Aug 2011 21:06:44 +0000 (UTC) (envelope-from freebsd-fs@m.gmane.org) Received: from lo.gmane.org (lo.gmane.org [80.91.229.12]) by mx1.freebsd.org (Postfix) with ESMTP id 8079D8FC12 for ; Thu, 25 Aug 2011 21:06:35 +0000 (UTC) Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1Qwh8B-0006Br-ES for freebsd-fs@freebsd.org; Thu, 25 Aug 2011 23:06:31 +0200 Received: from 208.88.188.90.adsl.tomsknet.ru ([90.188.88.208]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 25 Aug 2011 23:06:31 +0200 Received: from vadim_nuclight by 208.88.188.90.adsl.tomsknet.ru with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 25 Aug 2011 23:06:31 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-fs@freebsd.org From: Vadim Goncharov Date: Thu, 25 Aug 2011 21:06:18 +0000 (UTC) Organization: Nuclear Lightning @ Tomsk, TPU AVTF Hostel Lines: 28 Message-ID: References: <1092971110.92110.1313782831745.JavaMail.root@erie.cs.uoguelph.ca> <20110820145112.Y872@besplex.bde.org> <20110824015751.I2167@besplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: 208.88.188.90.adsl.tomsknet.ru X-Comment-To: Bruce Evans User-Agent: slrn/0.9.9p1 (FreeBSD) Subject: Re: touch(1) not working on directories in an msdosfs(5) envirement X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: vadim_nuclight@mail.ru List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Aug 2011 21:06:44 -0000 Hi Bruce Evans! On Wed, 24 Aug 2011 02:19:56 +1000 (EST); Bruce Evans wrote about 'Re: touch(1) not working on directories in an msdosfs(5) envirement': >>> The above is only the least serious of the bugs in msdosfs_setattr() :-(. >>> With the above fix, plain touch works as well as possible -- it cannot >>> work perfectly since setting of atimes is not always supported. But >>> touch -r and more importantly, cp -p only work as well as possible for >>> root, since they use utimes() without the null timeptr arg that allows >>> plain touch to work. A non-null timeptr arg ends up normally requiring >>> root permissions for msdosfs where it normally doesn't require extra >>> permissions for ffs, because ownership requirements for the non-null case >>> cannot be satisfied by file systems that don't really support ownerships. >>> We fudge the ownerships and use weak checks on them in most places, but >>> for utimes() we use strict checks that almost always fail: from my old >>> version: >> >> So, now the usual case of not touching directory times on change is preserved, >> but cp -r et al. sets times as expected? Sounds good, could it be committed >> please? > Yes, cp -p works but only if the user is root or the owner of the file. > Someone else will have to commit it. Umm... but why not you?.. -- WBR, Vadim Goncharov. ICQ#166852181 mailto:vadim_nuclight@mail.ru [Anti-Greenpeace][Sober FreeBSD zealot][http://nuclight.livejournal.com] From owner-freebsd-fs@FreeBSD.ORG Thu Aug 25 21:09:31 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BDA941065670; Thu, 25 Aug 2011 21:09:31 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 821DF8FC18; Thu, 25 Aug 2011 21:09:31 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 19EC946B06; Thu, 25 Aug 2011 17:09:31 -0400 (EDT) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 9566C8A02E; Thu, 25 Aug 2011 17:09:30 -0400 (EDT) From: John Baldwin To: Bruce Evans Date: Thu, 25 Aug 2011 17:09:29 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110617; KDE/4.5.5; amd64; ; ) References: <201108251347.45460.jhb@freebsd.org> <20110826043611.D2962@besplex.bde.org> In-Reply-To: <20110826043611.D2962@besplex.bde.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201108251709.30072.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Thu, 25 Aug 2011 17:09:30 -0400 (EDT) Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Aug 2011 21:09:31 -0000 On Thursday, August 25, 2011 3:24:15 pm Bruce Evans wrote: > On Thu, 25 Aug 2011, John Baldwin wrote: > > > I was doing some analysis of compiles over NFS at work recently and noticed > > from 'iostat 1' on the NFS server that all my NFS writes were always 16k > > writes (meaning that writes were never being clustered). I added some > > Did you see the old patches for this by Bjorn Gronwall? They went through > many iterations. He was mainly interested in the !async case and I was > mainly interested in the async case... Ah, no I had not seen these, thanks. > > and moved it into a function to compute a sequential I/O heuristic that > > could be shared by both reads and writes. I also updated the sequential > > heuristic code to advance the counter based on the number of 16k blocks > > in each write instead of just doing ++ to match what we do for local > > file writes in sequential_heuristic() in vfs_vnops.c. Using this did > > give me some measure of NFS write clustering (though I can't peg my > > disks at MAXPHYS the way a dd to a file on a local filesystem can). The > > I got close to it. The failure modes were mostly burstiness of i/o, where > the server buffer cache seemed to fill up so the client would stop sending > and stay stopped for too long (several seconds; enough to reduce the > throughput by 40-60%). Hmm, I can get writes up to around 40-50k, but not 128k. My test is to just dd from /dev/zero to a file on the NFS client using a blocksize of 64k or so. > > patch for these changes is at > > http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch > > > > (This also fixes a bug in the new NFS server in that it wasn't actually > > clustering reads since it never updated nh->nh_nextr.) > > > > Combining the two changes together gave me about a 1% reduction in wall > > time for my builds: > > > > +------------------------------------------------------------------------------+ > > |+ + ++ + +x++*x xx+x x x| > > | |___________A__|_M_______|_A____________| | > > +------------------------------------------------------------------------------+ > > N Min Max Median Avg Stddev > > x 10 1869.62 1943.11 1881.89 1886.12 21.549724 > > + 10 1809.71 1886.53 1869.26 1860.706 21.530664 > > Difference at 95.0% confidence > > -25.414 +/- 20.2391 > > -1.34742% +/- 1.07305% > > (Student's t, pooled s = 21.5402) > > > > One caveat: I tested both of these patches on the old NFS client and server > > on 8.2-stable. I then ported the changes to the new client and server and > > while I made sure they compiled, I have not tested the new client and server. > > Here is the version of Bjorn's patches that I last used (in 8-current in > 2008): > > % Index: nfs_serv.c > % =================================================================== > % RCS file: /home/ncvs/src/sys/nfsserver/nfs_serv.c,v > % retrieving revision 1.182 > % diff -u -2 -r1.182 nfs_serv.c > % --- nfs_serv.c 28 May 2008 16:23:17 -0000 1.182 > % +++ nfs_serv.c 1 Jun 2008 05:52:45 -0000 > % @@ -107,14 +107,18 @@ > % #define MAX_COMMIT_COUNT (1024 * 1024) > % > % -#define NUM_HEURISTIC 1017 > % +#define NUM_HEURISTIC 1031 /* Must be prime! */ > % +#define MAX_REORDERED_RPC 4 > % +#define HASH_MAXSTEP 0x3ff > % #define NHUSE_INIT 64 > % #define NHUSE_INC 16 > % #define NHUSE_MAX 2048 > % +#define NH_TAG(vp) ((uint32_t)((uintptr_t)vp / sizeof(struct vnode))) > % +CTASSERT(NUM_HEURISTIC > (HASH_MAXSTEP + 1)); Hmm, aside from 1017 not being prime (3 * 339), I'm not sure what the reasons for the rest of these changes are. > % > % static struct nfsheur { > % - struct vnode *nh_vp; /* vp to match (unreferenced pointer) */ > % - off_t nh_nextr; /* next offset for sequential detection */ > % - int nh_use; /* use count for selection */ > % - int nh_seqcount; /* heuristic */ > % + off_t nh_nextoff; /* next offset for sequential detection */ > % + uint32_t nh_tag; /* vp tag to match */ > % + uint16_t nh_use; /* use count for selection */ > % + uint16_t nh_seqcount; /* in units of logical blocks */ > % } nfsheur[NUM_HEURISTIC]; Hmm, not sure why this only stores the tag and uses uint16_t values here. The size difference is a few KB at best, and I'd rather store the full vnode to avoid oddities from hash collisions. > % @@ -131,8 +135,14 @@ > % static int nfs_commit_blks; > % static int nfs_commit_miss; > % +static int nfsrv_cluster_writes = 1; > % +static int nfsrv_cluster_reads = 1; > % +static int nfsrv_reordered_io; > % SYSCTL_INT(_vfs_nfsrv, OID_AUTO, async, CTLFLAG_RW, &nfs_async, 0, ""); > % SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_blks, CTLFLAG_RW, &nfs_commit_blks, 0, ""); > % SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_miss, CTLFLAG_RW, &nfs_commit_miss, 0, ""); > % > % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_writes, CTLFLAG_RW, &nfsrv_cluster_writes, 0, ""); > % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_reads, CTLFLAG_RW, &nfsrv_cluster_reads, 0, ""); > % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, reordered_io, CTLFLAG_RW, &nfsrv_reordered_io, 0, ""); > % struct nfsrvstats nfsrvstats; > % SYSCTL_STRUCT(_vfs_nfsrv, NFS_NFSRVSTATS, nfsrvstats, CTLFLAG_RW, > % @@ -145,4 +155,73 @@ > % > % /* > % + * Detect sequential access so that we are able to hint the underlying > % + * file system to use clustered I/O when appropriate. > % + */ > % +static int > % +nfsrv_sequential_access(const struct uio *uio, const struct vnode *vp) > % +{ > % + struct nfsheur *nh; > % + unsigned hi, step; > % + int try = 8; > % + int nblocks, lblocksize; > % + > % + /* > % + * Locate best nfsheur[] candidate using double hashing. > % + */ > % + > % + hi = NH_TAG(vp) % NUM_HEURISTIC; > % + step = NH_TAG(vp) & HASH_MAXSTEP; > % + step++; /* Step must not be zero. */ > % + nh = &nfsheur[hi]; I can't speak to whether using a variable step makes an appreciable difference. I have not examined that in detail in my tests. > % + > % + while (try--) { > % + if (nfsheur[hi].nh_tag == NH_TAG(vp)) { > % + nh = &nfsheur[hi]; > % + break; > % + } > % + if (nfsheur[hi].nh_use > 0) > % + --nfsheur[hi].nh_use; > % + hi = hi + step; > % + if (hi >= NUM_HEURISTIC) > % + hi -= NUM_HEURISTIC; > % + if (nfsheur[hi].nh_use < nh->nh_use) > % + nh = &nfsheur[hi]; > % + } > % + > % + if (nh->nh_tag != NH_TAG(vp)) { /* New entry. */ > % + nh->nh_tag = NH_TAG(vp); > % + nh->nh_nextoff = uio->uio_offset; > % + nh->nh_use = NHUSE_INIT; > % + nh->nh_seqcount = 1; /* Initially assume sequential access. */ > % + } else { > % + nh->nh_use += NHUSE_INC; > % + if (nh->nh_use > NHUSE_MAX) > % + nh->nh_use = NHUSE_MAX; > % + } > % + > % + /* > % + * Calculate heuristic > % + */ > % + > % + lblocksize = vp->v_mount->mnt_stat.f_iosize; > % + nblocks = howmany(uio->uio_resid, lblocksize); This is similar to what I pulled out of sequential_heuristic() except that it doesn't hardcode 16k. There is a big comment above the 16k that says it isn't about the blocksize though, so I'm not sure which is most correct. I imagine we'd want to use the same strategy in both places though. Comment from vfs_vnops.c: /* * f_seqcount is in units of fixed-size blocks so that it * depends mainly on the amount of sequential I/O and not * much on the number of sequential I/O's. The fixed size * of 16384 is hard-coded here since it is (not quite) just * a magic size that works well here. This size is more * closely related to the best I/O size for real disks than * to any block size used by software. */ fp->f_seqcount += howmany(uio->uio_resid, 16384); > % + if (uio->uio_offset == nh->nh_nextoff) { > % + nh->nh_seqcount += nblocks; > % + if (nh->nh_seqcount > IO_SEQMAX) > % + nh->nh_seqcount = IO_SEQMAX; > % + } else if (uio->uio_offset == 0) { > % + /* Seek to beginning of file, ignored. */ > % + } else if (qabs(uio->uio_offset - nh->nh_nextoff) <= > % + MAX_REORDERED_RPC*imax(lblocksize, uio->uio_resid)) { > % + nfsrv_reordered_io++; /* Probably reordered RPC, do nothing. */ Ah, this is a nice touch! I had noticed reordered I/O's resetting my clustered I/O count. I should try this extra step. > % + } else > % + nh->nh_seqcount /= 2; /* Not sequential access. */ Hmm, this is a bit different as well. sequential_heuristic() just drops all clustering (seqcount = 1) here so I had followed that. I do wonder if this change would be good for "normal" I/O as well? (Again, I think it would do well to have "normal" I/O and NFS generally use the same algorithm, but perhaps with the extra logic to handle reordered writes more gracefully for NFS.) > % + > % + nh->nh_nextoff = uio->uio_offset + uio->uio_resid; Interesting. So this assumes the I/O never fails. > % @@ -1225,4 +1251,5 @@ > % vn_finished_write(mntp); > % VFS_UNLOCK_GIANT(vfslocked); > % + bwillwrite(); /* After VOP_WRITE to avoid reordering. */ > % return(error); > % } Hmm, this seems to be related to avoiding overloading the NFS server's buffer cache? > % @@ -1492,4 +1519,6 @@ > % } > % if (!error) { > % + if (nfsrv_cluster_writes) > % + ioflags |= nfsrv_sequential_access(uiop, vp); > % error = VOP_WRITE(vp, uiop, ioflags, cred); > % /* XXXRW: unlocked write. */ > % @@ -1582,4 +1611,5 @@ > % } > % splx(s); > % + bwillwrite(); /* After VOP_WRITE to avoid reordering. */ > % return (0); > % } Ah, this code is no longer present in 8 (but is in 7). > % @@ -3827,5 +3857,9 @@ > % for_ret = VOP_GETATTR(vp, &bfor, cred, td); > % > % - if (cnt > MAX_COMMIT_COUNT) { > % + /* > % + * If count is 0, a flush from offset to the end of file > % + * should be performed according to RFC 1813. > % + */ > % + if (cnt == 0 || cnt > MAX_COMMIT_COUNT) { > % /* > % * Give up and do the whole thing This appears to be a seperate standalone fix. > % @@ -3871,5 +3905,5 @@ > % bo = &vp->v_bufobj; > % BO_LOCK(bo); > % - while (cnt > 0) { > % + while (!error && cnt > 0) { > % struct buf *bp; > % > % @@ -3894,5 +3928,5 @@ > % bremfree(bp); > % bp->b_flags &= ~B_ASYNC; > % - bwrite(bp); > % + error = bwrite(bp); > % ++nfs_commit_miss; I think you can just do a 'break' here without having to modify the while condition. This also seems to be an unrelated bugfix. > % } else > % Index: nfs_syscalls.c > % =================================================================== > % RCS file: /home/ncvs/src/sys/nfsserver/Attic/nfs_syscalls.c,v > % retrieving revision 1.119 > % diff -u -2 -r1.119 nfs_syscalls.c > % --- nfs_syscalls.c 30 Jun 2008 20:43:06 -0000 1.119 > % +++ nfs_syscalls.c 2 Jul 2008 07:12:57 -0000 > % @@ -86,5 +86,4 @@ > % int nfsd_waiting = 0; > % int nfsrv_numnfsd = 0; > % -static int notstarted = 1; > % > % static int nfs_privport = 0; > % @@ -448,7 +447,6 @@ > % procrastinate = nfsrvw_procrastinate; > % NFSD_UNLOCK(); > % - if (writes_todo || (!(nd->nd_flag & ND_NFSV3) && > % - nd->nd_procnum == NFSPROC_WRITE && > % - procrastinate > 0 && !notstarted)) > % + if (writes_todo || (nd->nd_procnum == NFSPROC_WRITE && > % + procrastinate > 0)) > % error = nfsrv_writegather(&nd, slp, > % nfsd->nfsd_td, &mreq); This no longer seems to be present in 8. > I have forgotten many details but can dig them out from large private > mails about this if you care. It looks like most of the above is about > getting the seqcount write; probably similar to what you did. -current > is still missing the error check for the bwrite() -- that's the only > bwrite() in the whole server. However, the fix isn't good -- one obvious > bug is that a later successful bwrite() destroys the evidence of a > previous bwrite() error. One thing I had done was to use a separate set of heuristics for reading vs writing. However, that is possibly dubious (and we don't do it for local I/O), so I can easily drop that feature if desired. -- John Baldwin From owner-freebsd-fs@FreeBSD.ORG Thu Aug 25 22:14:17 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D9B7A1065670; Thu, 25 Aug 2011 22:14:17 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx09.syd.optusnet.com.au (fallbackmx09.syd.optusnet.com.au [211.29.132.242]) by mx1.freebsd.org (Postfix) with ESMTP id 3D3078FC14; Thu, 25 Aug 2011 22:14:16 +0000 (UTC) Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au [211.29.132.188]) by fallbackmx09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p7PJOka7019527; Fri, 26 Aug 2011 05:24:46 +1000 Received: from c122-106-165-191.carlnfd1.nsw.optusnet.com.au (c122-106-165-191.carlnfd1.nsw.optusnet.com.au [122.106.165.191]) by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p7PJOb1x016474 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 26 Aug 2011 05:24:42 +1000 Date: Fri, 26 Aug 2011 05:24:15 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <201108251347.45460.jhb@freebsd.org> Message-ID: <20110826043611.D2962@besplex.bde.org> References: <201108251347.45460.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Aug 2011 22:14:17 -0000 On Thu, 25 Aug 2011, John Baldwin wrote: > I was doing some analysis of compiles over NFS at work recently and noticed > from 'iostat 1' on the NFS server that all my NFS writes were always 16k > writes (meaning that writes were never being clustered). I added some Did you see the old patches for this by Bjorn Gronwall? They went through many iterations. He was mainly interested in the !async case and I was mainly interested in the async case... > debugging sysctls to the NFS client and server code as well as the FFS write > VOP to figure out the various kind of write requests that were being sent. I > found that during the NFS compile, the NFS client was sending a lot of > FILESYNC writes even though nothing in the compile process uses fsync(). > Based on the debugging I added, I found that all of the FILESYNC writes were > marked as such because the buffer in question did not have B_ASYNC set: > > > if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) == B_ASYNC) > iomode = NFSV3WRITE_UNSTABLE; > else > iomode = NFSV3WRITE_FILESYNC; The async-mounted case (that's async on the server; async on the client bogusly succeeds and does nothing or less) should make the writes async in all cases, and thus make the beaviour independent of the client's sync requests (except hopefully fsync(2) on the client works -- it is mostly unbroken in my ffs servers), and almost independent of the client's write sizes (they should get grouped into large clusters after they have had time to accumulate), and almost independed of the server's write clustering at the nfs level (the vfs clustering level should do it). However, there was some problem with async too. > I eventually tracked this down to the code in the NFS client that pushes out a > previous dirty region via 'bwrite()' when a write would dirty a non-contiguous > region in the buffer: > > if (bp->b_dirtyend > 0 && > (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) { > if (bwrite(bp) == EINTR) { > error = EINTR; > break; > } > goto again; > } > > (These writes are triggered during the compile of a file by the assembler > seeking back into the file it has already written out to apply various > fixups.) I mainly tested huge (~1GB) sequential writes, which is exactly the opposite of this case. >> From this I concluded that the test above is flawed. We should be using > UNSTABLE writes for the writes above as the user has not requested them to > be synchronous. The issue (I believe) is that the NFS client is overloading > the B_ASYNC flag. The B_ASYNC flag means that the caller of bwrite() > (or rather bawrite()) is not synchronously blocking to see if the request > has completed. Instead, it is a "fire and forget". This is not the same > thing as the IO_SYNC flag passed in ioflags during a write request which > requests fsync()-like behavior. To disambiguate the two I added a new > B_SYNC flag and changed the NFS clients to set this for write requests > with IO_SYNC set. I then updated the condition above to instead check for > B_SYNC being set rather than checking for B_ASYNC being clear. Seems reasonable. > That converted all the FILESYNC write RPCs from my builds into UNSTABLE > write RPCs. The patch for that is at > http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch. > > However, even with this change I was still not getting clustered writes on > the NFS server (all writes were still 16k). After digging around in the > code for a bit I found that ffs will only cluster writes if the passed in > 'ioflags' to ffs_write() specify a sequential hint. I then noticed that > the NFS server has code to keep track of sequential I/O heuristics for > reads, but not writes. I took the code from the NFS server's read op Bjorn's patches were mostly for this. We also tried changing the write gathering code and/or sysctls to control it. > and moved it into a function to compute a sequential I/O heuristic that > could be shared by both reads and writes. I also updated the sequential > heuristic code to advance the counter based on the number of 16k blocks > in each write instead of just doing ++ to match what we do for local > file writes in sequential_heuristic() in vfs_vnops.c. Using this did > give me some measure of NFS write clustering (though I can't peg my > disks at MAXPHYS the way a dd to a file on a local filesystem can). The I got close to it. The failure modes were mostly burstiness of i/o, where the server buffer cache seemed to fill up so the client would stop sending and stay stopped for too long (several seconds; enough to reduce the throughput by 40-60%). > patch for these changes is at > http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch > > (This also fixes a bug in the new NFS server in that it wasn't actually > clustering reads since it never updated nh->nh_nextr.) > > Combining the two changes together gave me about a 1% reduction in wall > time for my builds: > > +------------------------------------------------------------------------------+ > |+ + ++ + +x++*x xx+x x x| > | |___________A__|_M_______|_A____________| | > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 10 1869.62 1943.11 1881.89 1886.12 21.549724 > + 10 1809.71 1886.53 1869.26 1860.706 21.530664 > Difference at 95.0% confidence > -25.414 +/- 20.2391 > -1.34742% +/- 1.07305% > (Student's t, pooled s = 21.5402) > > One caveat: I tested both of these patches on the old NFS client and server > on 8.2-stable. I then ported the changes to the new client and server and > while I made sure they compiled, I have not tested the new client and server. Here is the version of Bjorn's patches that I last used (in 8-current in 2008): % Index: nfs_serv.c % =================================================================== % RCS file: /home/ncvs/src/sys/nfsserver/nfs_serv.c,v % retrieving revision 1.182 % diff -u -2 -r1.182 nfs_serv.c % --- nfs_serv.c 28 May 2008 16:23:17 -0000 1.182 % +++ nfs_serv.c 1 Jun 2008 05:52:45 -0000 % @@ -107,14 +107,18 @@ % #define MAX_COMMIT_COUNT (1024 * 1024) % % -#define NUM_HEURISTIC 1017 % +#define NUM_HEURISTIC 1031 /* Must be prime! */ % +#define MAX_REORDERED_RPC 4 % +#define HASH_MAXSTEP 0x3ff % #define NHUSE_INIT 64 % #define NHUSE_INC 16 % #define NHUSE_MAX 2048 % +#define NH_TAG(vp) ((uint32_t)((uintptr_t)vp / sizeof(struct vnode))) % +CTASSERT(NUM_HEURISTIC > (HASH_MAXSTEP + 1)); % % static struct nfsheur { % - struct vnode *nh_vp; /* vp to match (unreferenced pointer) */ % - off_t nh_nextr; /* next offset for sequential detection */ % - int nh_use; /* use count for selection */ % - int nh_seqcount; /* heuristic */ % + off_t nh_nextoff; /* next offset for sequential detection */ % + uint32_t nh_tag; /* vp tag to match */ % + uint16_t nh_use; /* use count for selection */ % + uint16_t nh_seqcount; /* in units of logical blocks */ % } nfsheur[NUM_HEURISTIC]; % % @@ -131,8 +135,14 @@ % static int nfs_commit_blks; % static int nfs_commit_miss; % +static int nfsrv_cluster_writes = 1; % +static int nfsrv_cluster_reads = 1; % +static int nfsrv_reordered_io; % SYSCTL_INT(_vfs_nfsrv, OID_AUTO, async, CTLFLAG_RW, &nfs_async, 0, ""); % SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_blks, CTLFLAG_RW, &nfs_commit_blks, 0, ""); % SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_miss, CTLFLAG_RW, &nfs_commit_miss, 0, ""); % % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_writes, CTLFLAG_RW, &nfsrv_cluster_writes, 0, ""); % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_reads, CTLFLAG_RW, &nfsrv_cluster_reads, 0, ""); % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, reordered_io, CTLFLAG_RW, &nfsrv_reordered_io, 0, ""); % struct nfsrvstats nfsrvstats; % SYSCTL_STRUCT(_vfs_nfsrv, NFS_NFSRVSTATS, nfsrvstats, CTLFLAG_RW, % @@ -145,4 +155,73 @@ % % /* % + * Detect sequential access so that we are able to hint the underlying % + * file system to use clustered I/O when appropriate. % + */ % +static int % +nfsrv_sequential_access(const struct uio *uio, const struct vnode *vp) % +{ % + struct nfsheur *nh; % + unsigned hi, step; % + int try = 8; % + int nblocks, lblocksize; % + % + /* % + * Locate best nfsheur[] candidate using double hashing. % + */ % + % + hi = NH_TAG(vp) % NUM_HEURISTIC; % + step = NH_TAG(vp) & HASH_MAXSTEP; % + step++; /* Step must not be zero. */ % + nh = &nfsheur[hi]; % + % + while (try--) { % + if (nfsheur[hi].nh_tag == NH_TAG(vp)) { % + nh = &nfsheur[hi]; % + break; % + } % + if (nfsheur[hi].nh_use > 0) % + --nfsheur[hi].nh_use; % + hi = hi + step; % + if (hi >= NUM_HEURISTIC) % + hi -= NUM_HEURISTIC; % + if (nfsheur[hi].nh_use < nh->nh_use) % + nh = &nfsheur[hi]; % + } % + % + if (nh->nh_tag != NH_TAG(vp)) { /* New entry. */ % + nh->nh_tag = NH_TAG(vp); % + nh->nh_nextoff = uio->uio_offset; % + nh->nh_use = NHUSE_INIT; % + nh->nh_seqcount = 1; /* Initially assume sequential access. */ % + } else { % + nh->nh_use += NHUSE_INC; % + if (nh->nh_use > NHUSE_MAX) % + nh->nh_use = NHUSE_MAX; % + } % + % + /* % + * Calculate heuristic % + */ % + % + lblocksize = vp->v_mount->mnt_stat.f_iosize; % + nblocks = howmany(uio->uio_resid, lblocksize); % + if (uio->uio_offset == nh->nh_nextoff) { % + nh->nh_seqcount += nblocks; % + if (nh->nh_seqcount > IO_SEQMAX) % + nh->nh_seqcount = IO_SEQMAX; % + } else if (uio->uio_offset == 0) { % + /* Seek to beginning of file, ignored. */ % + } else if (qabs(uio->uio_offset - nh->nh_nextoff) <= % + MAX_REORDERED_RPC*imax(lblocksize, uio->uio_resid)) { % + nfsrv_reordered_io++; /* Probably reordered RPC, do nothing. */ % + } else % + nh->nh_seqcount /= 2; /* Not sequential access. */ % + % + nh->nh_nextoff = uio->uio_offset + uio->uio_resid; % + % + return (nh->nh_seqcount << IO_SEQSHIFT); % +} % + % +/* % * Clear nameidata fields that are tested in nsfmout cleanup code prior % * to using first nfsm macro (that might jump to the cleanup code). % @@ -785,5 +864,4 @@ % struct uio io, *uiop = &io; % struct vattr va, *vap = &va; % - struct nfsheur *nh; % off_t off; % int ioflag = 0; % @@ -857,59 +935,4 @@ % cnt = reqlen; % % - /* % - * Calculate seqcount for heuristic % - */ % - % - { % - int hi; % - int try = 32; % - % - /* % - * Locate best candidate % - */ % - % - hi = ((int)(vm_offset_t)vp / sizeof(struct vnode)) % NUM_HEURISTIC; % - nh = &nfsheur[hi]; % - % - while (try--) { % - if (nfsheur[hi].nh_vp == vp) { % - nh = &nfsheur[hi]; % - break; % - } % - if (nfsheur[hi].nh_use > 0) % - --nfsheur[hi].nh_use; % - hi = (hi + 1) % NUM_HEURISTIC; % - if (nfsheur[hi].nh_use < nh->nh_use) % - nh = &nfsheur[hi]; % - } % - % - if (nh->nh_vp != vp) { % - nh->nh_vp = vp; % - nh->nh_nextr = off; % - nh->nh_use = NHUSE_INIT; % - if (off == 0) % - nh->nh_seqcount = 4; % - else % - nh->nh_seqcount = 1; % - } % - % - /* % - * Calculate heuristic % - */ % - % - if ((off == 0 && nh->nh_seqcount > 0) || off == nh->nh_nextr) { % - if (++nh->nh_seqcount > IO_SEQMAX) % - nh->nh_seqcount = IO_SEQMAX; % - } else if (nh->nh_seqcount > 1) { % - nh->nh_seqcount = 1; % - } else { % - nh->nh_seqcount = 0; % - } % - nh->nh_use += NHUSE_INC; % - if (nh->nh_use > NHUSE_MAX) % - nh->nh_use = NHUSE_MAX; % - ioflag |= nh->nh_seqcount << IO_SEQSHIFT; % - } % - % nfsm_reply(NFSX_POSTOPORFATTR(v3) + 3 * NFSX_UNSIGNED+nfsm_rndup(cnt)); % if (v3) { % @@ -969,7 +992,8 @@ % uiop->uio_rw = UIO_READ; % uiop->uio_segflg = UIO_SYSSPACE; % + if (nfsrv_cluster_reads) % + ioflag |= nfsrv_sequential_access(uiop, vp); % error = VOP_READ(vp, uiop, IO_NODELOCKED | ioflag, cred); % off = uiop->uio_offset; % - nh->nh_nextr = off; % FREE((caddr_t)iv2, M_TEMP); % if (error || (getret = VOP_GETATTR(vp, vap, cred, td))) { % @@ -1177,4 +1201,6 @@ % uiop->uio_td = NULL; % uiop->uio_offset = off; % + if (nfsrv_cluster_writes) % + ioflags |= nfsrv_sequential_access(uiop, vp); % error = VOP_WRITE(vp, uiop, ioflags, cred); % /* XXXRW: unlocked write. */ % @@ -1225,4 +1251,5 @@ % vn_finished_write(mntp); % VFS_UNLOCK_GIANT(vfslocked); % + bwillwrite(); /* After VOP_WRITE to avoid reordering. */ % return(error); % } % @@ -1492,4 +1519,6 @@ % } % if (!error) { % + if (nfsrv_cluster_writes) % + ioflags |= nfsrv_sequential_access(uiop, vp); % error = VOP_WRITE(vp, uiop, ioflags, cred); % /* XXXRW: unlocked write. */ % @@ -1582,4 +1611,5 @@ % } % splx(s); % + bwillwrite(); /* After VOP_WRITE to avoid reordering. */ % return (0); % } % @@ -3827,5 +3857,9 @@ % for_ret = VOP_GETATTR(vp, &bfor, cred, td); % % - if (cnt > MAX_COMMIT_COUNT) { % + /* % + * If count is 0, a flush from offset to the end of file % + * should be performed according to RFC 1813. % + */ % + if (cnt == 0 || cnt > MAX_COMMIT_COUNT) { % /* % * Give up and do the whole thing % @@ -3871,5 +3905,5 @@ % bo = &vp->v_bufobj; % BO_LOCK(bo); % - while (cnt > 0) { % + while (!error && cnt > 0) { % struct buf *bp; % % @@ -3894,5 +3928,5 @@ % bremfree(bp); % bp->b_flags &= ~B_ASYNC; % - bwrite(bp); % + error = bwrite(bp); % ++nfs_commit_miss; % } else % Index: nfs_syscalls.c % =================================================================== % RCS file: /home/ncvs/src/sys/nfsserver/Attic/nfs_syscalls.c,v % retrieving revision 1.119 % diff -u -2 -r1.119 nfs_syscalls.c % --- nfs_syscalls.c 30 Jun 2008 20:43:06 -0000 1.119 % +++ nfs_syscalls.c 2 Jul 2008 07:12:57 -0000 % @@ -86,5 +86,4 @@ % int nfsd_waiting = 0; % int nfsrv_numnfsd = 0; % -static int notstarted = 1; % % static int nfs_privport = 0; % @@ -448,7 +447,6 @@ % procrastinate = nfsrvw_procrastinate; % NFSD_UNLOCK(); % - if (writes_todo || (!(nd->nd_flag & ND_NFSV3) && % - nd->nd_procnum == NFSPROC_WRITE && % - procrastinate > 0 && !notstarted)) % + if (writes_todo || (nd->nd_procnum == NFSPROC_WRITE && % + procrastinate > 0)) % error = nfsrv_writegather(&nd, slp, % nfsd->nfsd_td, &mreq); I have forgotten many details but can dig them out from large private mails about this if you care. It looks like most of the above is about getting the seqcount write; probably similar to what you did. -current is still missing the error check for the bwrite() -- that's the only bwrite() in the whole server. However, the fix isn't good -- one obvious bug is that a later successful bwrite() destroys the evidence of a previous bwrite() error. Bruce From owner-freebsd-fs@FreeBSD.ORG Thu Aug 25 23:03:05 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BFDBE1065766; Thu, 25 Aug 2011 23:03:05 +0000 (UTC) (envelope-from bfriesen@simple.dallas.tx.us) Received: from blade.simplesystems.org (blade.simplesystems.org [65.66.246.74]) by mx1.freebsd.org (Postfix) with ESMTP id 67F8A8FC1F; Thu, 25 Aug 2011 23:03:05 +0000 (UTC) Received: from freddy.simplesystems.org (freddy.simplesystems.org [65.66.246.65]) by blade.simplesystems.org (8.14.4+Sun/8.14.4) with ESMTP id p7PMpFtX007607; Thu, 25 Aug 2011 17:51:15 -0500 (CDT) Date: Thu, 25 Aug 2011 17:51:15 -0500 (CDT) From: Bob Friesenhahn X-X-Sender: bfriesen@freddy.simplesystems.org To: John Baldwin In-Reply-To: <201108251347.45460.jhb@freebsd.org> Message-ID: References: <201108251347.45460.jhb@freebsd.org> User-Agent: Alpine 2.01 (GSO 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2 (blade.simplesystems.org [65.66.246.90]); Thu, 25 Aug 2011 17:51:15 -0500 (CDT) Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Aug 2011 23:03:05 -0000 On Thu, 25 Aug 2011, John Baldwin wrote: > I was doing some analysis of compiles over NFS at work recently and noticed > from 'iostat 1' on the NFS server that all my NFS writes were always 16k > writes (meaning that writes were never being clustered). I added some > debugging sysctls to the NFS client and server code as well as the FFS write > VOP to figure out the various kind of write requests that were being sent. I > found that during the NFS compile, the NFS client was sending a lot of > FILESYNC writes even though nothing in the compile process uses fsync(). A fundamental principle of NFS is that writes are synchronous so that if the server spontaneously reboots, all the acknowledged writes will still be present on disk and the client just continues (after a delay) without loss/corruption of data. NFSv3 added the ability to send uncommitted data to the server, with the agreement that the client would agree to re-send any uncommitted data if the server spontaneously rebooted. Most clients are not responsibly prepared to participate in this since it would require some non-volatile local storage on the client. I don't know if your changes would harm these expectations. Regardless, there is little doubt that the default client NFS in FreeBSD 8.2 suffers quite a lot in sequential write performance as compared with an OS like Solaris. Hopefully the new NFS that Rick Macklem has been working on (and is apparently ready for general use) will perform much better. Since FreeBSD is switching to the new implementation it seems like that is where the efforts should be going. Bob -- Bob Friesenhahn bfriesen@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 01:16:18 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D920F106564A; Fri, 26 Aug 2011 01:16:18 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 6E16D8FC08; Fri, 26 Aug 2011 01:16:18 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap4EADfzVk6DaFvO/2dsb2JhbABDhEykPIFAAQEEASNWBRYOCgICDRkCWQYTCYdpBKhAkV+BLIQPgREEkxmRFw X-IronPort-AV: E=Sophos;i="4.68,283,1312171200"; d="scan'208";a="132288631" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 25 Aug 2011 21:16:17 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 9DD1FB3F2D; Thu, 25 Aug 2011 21:16:17 -0400 (EDT) Date: Thu, 25 Aug 2011 21:16:17 -0400 (EDT) From: Rick Macklem To: Bob Friesenhahn Message-ID: <742759548.371106.1314321377626.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 01:16:19 -0000 Bob Friesenhahn wrote: > On Thu, 25 Aug 2011, John Baldwin wrote: > > > I was doing some analysis of compiles over NFS at work recently and > > noticed > > from 'iostat 1' on the NFS server that all my NFS writes were always > > 16k > > writes (meaning that writes were never being clustered). I added > > some > > debugging sysctls to the NFS client and server code as well as the > > FFS write > > VOP to figure out the various kind of write requests that were being > > sent. I > > found that during the NFS compile, the NFS client was sending a lot > > of > > FILESYNC writes even though nothing in the compile process uses > > fsync(). > > A fundamental principle of NFS is that writes are synchronous so that > if the server spontaneously reboots, all the acknowledged writes will > still be present on disk and the client just continues (after a delay) > without loss/corruption of data. NFSv3 added the ability to send > uncommitted data to the server, with the agreement that the client > would agree to re-send any uncommitted data if the server > spontaneously rebooted. Most clients are not responsibly prepared to > participate in this since it would require some non-volatile local > storage on the client. > Although I wouldn't want to say it's bug free, I believe that the FreeBSD NFS client code (the new client clones the old one in this regard) does handle UNSTABLE (data that will be committed later or re-written if the server reboots before the Commit RPC completes). I have tested this a little and it seemed to work, including doing the write RPCs again, if the server was rebooted before the Commit RPC completed. I think that the tradition of asynchronous writes (where the RPC is started right away) needs to be largely replaced by delayed writes (just mark the block dirty and write it back sometime later). The trick here is to avoid flooding the buffer cache or generating large bursts of write RPCs by doing the write backs at an appropriate rate and using the largest size possible. (NFSv3,4 servers specify the largest write RPC size they can handle. As I noted in the other post, this is 1Mbyte for Solaris10 and I'd like to see the FreeBSD server doing the same, but it's currently only MAX_BSIZE == 64K.) > I don't know if your changes would harm these expectations. > > Regardless, there is little doubt that the default client NFS in > FreeBSD 8.2 suffers quite a lot in sequential write performance as > compared with an OS like Solaris. Hopefully the new NFS that Rick > Macklem has been working on (and is apparently ready for general use) > will perform much better. Since FreeBSD is switching to the new > implementation it seems like that is where the efforts should be > going. > Well, the two clients are clones w.r.t. the buffer cache stuff at this point. I did that because: 1 - I don't understand the buffer cache code well enough to modify it without breaking it. 2 - I wanted the 2 clients to be "bug compatible" during the switchover of defaults. Given this, the performance will be about the same at this point. However, getting the clients to do less synchronous writing (both w.r.t. doing them right away and w.r.t. setting FILESYNC instead of UNSTABLE) and fewer big write RPCs could be worth the effort, I think? One thing I do have in the "futures" list (I should have a patch that can be tested by others out soon) does client side on-disk caching, but only for the specific case where the client holds an NFSv4 delegation for the file. (I call this Packrats, so when you see a posting about a Packrats patch, this is what it is and if you can try it, please do so. You might like how it performs.:-) rick > Bob > -- > Bob Friesenhahn > bfriesen@simple.dallas.tx.us, > http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 01:26:34 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DCA051065691 for ; Fri, 26 Aug 2011 01:26:34 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 8154A8FC0C for ; Fri, 26 Aug 2011 01:26:34 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap8EAI/uVk6DaFvO/2dsb2JhbAA7CBaENqQ8gUABAQQBIwRSBRYOCgICDRkCWQYch2kEqEiRXYEsgX+CEIERBJMZkRc X-IronPort-AV: E=Sophos;i="4.68,283,1312171200"; d="scan'208";a="135592738" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 25 Aug 2011 20:57:38 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 286FBB3F06; Thu, 25 Aug 2011 20:57:38 -0400 (EDT) Date: Thu, 25 Aug 2011 20:57:38 -0400 (EDT) From: Rick Macklem To: John Baldwin Message-ID: <354219004.370807.1314320258153.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201108251347.45460.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 01:26:34 -0000 John Baldwin wrote: > I was doing some analysis of compiles over NFS at work recently and > noticed > from 'iostat 1' on the NFS server that all my NFS writes were always > 16k > writes (meaning that writes were never being clustered). I added some > debugging sysctls to the NFS client and server code as well as the FFS > write > VOP to figure out the various kind of write requests that were being > sent. I > found that during the NFS compile, the NFS client was sending a lot of > FILESYNC writes even though nothing in the compile process uses > fsync(). > Based on the debugging I added, I found that all of the FILESYNC > writes were > marked as such because the buffer in question did not have B_ASYNC > set: > > > if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) > == B_ASYNC) > iomode = NFSV3WRITE_UNSTABLE; > else > iomode = NFSV3WRITE_FILESYNC; > > I eventually tracked this down to the code in the NFS client that > pushes out a > previous dirty region via 'bwrite()' when a write would dirty a > non-contiguous > region in the buffer: > > if (bp->b_dirtyend > 0 && > (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) { > if (bwrite(bp) == EINTR) { > error = EINTR; > break; > } > goto again; > } > Well I'm not sure this bwrite() is required in the current buffer cache implementation. I far from understand the FreeBSD NFS buffer cache code (I just cloned it for the new NFS client), but it seems to pre-read in the block before writing it out, unless the write is of a full block. If this is the case, doing multiple non-contiguous writes to the buffer should be fine, I think? Way back in the late 1980s I implemented NFS buffer cache code that worked this way for writing: - dirty a region of the buffer - if another write that was contiguous with this region occurred, grow the dirty region - else if a non-contiguous write occurred, write the dirty region out synchronously, and then write the new dirty region - else if a read that was within the dirty region occurred, copy the data out of the buffer - else if a read that wasn't within the dirty region occurred, - write the dirty region out - read in the entire block - copy the data from the block Somewhere along the way, I believe this was converted to the Sun style (which is more like a local file system) where: - if a write is of a partial block - read the block in - modify the region of the block - mark the block dirty and start an asynchronous write of the block with this model, all the data in the block is up-to-date, so it doesn't really matter how many non-contiguous region(s) you modify, if you write the entire block after the modification(s). (For simplicity, I haven't worried about the EOF. Either model has to keep track of the file's size "np->n_size" and deal with where EOF is.) I have a "hunch" that the above code snippet is left over from my old model. (Which worked quite well for the old Portable C compiler and linker of the 4BSD Vax era.) I don't think any write is necessary at that place any more. (I also don't think it should need to keep track of b_dirtyoff, b_dirtyend.) Another thing that would be nice is to allow reads from the block while it is being written back. (Writes either need to be blocked or handled so that the block is still known to be dirty and needs to be written again, but reading the data during the write back should be fine.) I don't know, but I don't think that can happen now, given the way the buffers are locked during writing. > (These writes are triggered during the compile of a file by the > assembler > seeking back into the file it has already written out to apply various > fixups.) > > From this I concluded that the test above is flawed. We should be > using > UNSTABLE writes for the writes above as the user has not requested > them to > be synchronous. It was synchronous (in the sense that it would wait until the write RPC was completed) in the old model, since the new dirty region replaced the old one. (As noted above, I don't think a bwrite() of any kind is needed there now.) > The issue (I believe) is that the NFS client is > overloading > the B_ASYNC flag. The B_ASYNC flag means that the caller of bwrite() > (or rather bawrite()) is not synchronously blocking to see if the > request > has completed. Instead, it is a "fire and forget". This is not the > same > thing as the IO_SYNC flag passed in ioflags during a write request > which > requests fsync()-like behavior. To disambiguate the two I added a new > B_SYNC flag and changed the NFS clients to set this for write requests > with IO_SYNC set. I then updated the condition above to instead check > for > B_SYNC being set rather than checking for B_ASYNC being clear. > Yes, in the bad old days B_ASYNC meant start the write RPC but don't wait for it to complete. I think what you have said is correct above, in that setting FILESYNC should be based on the IO_SYNC flag only. > That converted all the FILESYNC write RPCs from my builds into > UNSTABLE > write RPCs. The patch for that is at > http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch. > > However, even with this change I was still not getting clustered > writes on > the NFS server (all writes were still 16k). After digging around in > the > code for a bit I found that ffs will only cluster writes if the passed > in > 'ioflags' to ffs_write() specify a sequential hint. I then noticed > that > the NFS server has code to keep track of sequential I/O heuristics for > reads, but not writes. I took the code from the NFS server's read op > and moved it into a function to compute a sequential I/O heuristic > that > could be shared by both reads and writes. I also updated the > sequential > heuristic code to advance the counter based on the number of 16k > blocks > in each write instead of just doing ++ to match what we do for local > file writes in sequential_heuristic() in vfs_vnops.c. Using this did > give me some measure of NFS write clustering (though I can't peg my > disks at MAXPHYS the way a dd to a file on a local filesystem can). > The > patch for these changes is at > http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch > The above says you understand this stuff and I don't. However, I will note that the asynchronous case, which starts the write RPC now, makes clustering difficult and limits what you can do. (I think it was done in the bad old days to avoid flooding the buffer cache and then having things pushing writes back to get buffers. These days the buffer cache can be much bigger and it's easy to create kernel threads to do write backs at appropriate times. As such, I'd lean away from asynchronous (as in start the write now) and towards delayed writes. If the writes are delayed "bdwrite()" then I think it is much easier to find contiguous dirty buffers to do as one write RPC. However, if you just do bdwrite()s, there tends to be big bursts of write RPCs when the syncer does its thing, unless kernel threads are working through the cache doing write backs. Since there are nfsiod threads, maybe these could scan for contiguous dirty buffers and start big write RPCs for them? If there was some time limit set for how long the buffer sits dirty before it gets a write started for it, that would avoid a burst caused by the syncer. Also, if you are lucky w.r.t. doing delayed writes for temporary files, the file gets deleted before the write-back. > (This also fixes a bug in the new NFS server in that it wasn't > actually > clustering reads since it never updated nh->nh_nextr.) > Thanks, I have no idea what this is... > Combining the two changes together gave me about a 1% reduction in > wall > time for my builds: > > +------------------------------------------------------------------------------+ > |+ + ++ + +x++*x xx+x x x| > | |___________A__|_M_______|_A____________| | > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 10 1869.62 1943.11 1881.89 1886.12 21.549724 > + 10 1809.71 1886.53 1869.26 1860.706 21.530664 > Difference at 95.0% confidence > -25.414 +/- 20.2391 > -1.34742% +/- 1.07305% > (Student's t, pooled s = 21.5402) > > One caveat: I tested both of these patches on the old NFS client and > server > on 8.2-stable. I then ported the changes to the new client and server > and > while I made sure they compiled, I have not tested the new client and > server. > > -- > John Baldwin Good stuff John. Sounds like you're having fun with it. I would like to see it clustering using delayed writes and then doing the biggest write RPCs the server will allow. (Solaris10 allows 1Mbyte write RPCs. At this point, FreeBSD is limited to MAX_BSIZE, which is only 64K but hopefully can be increased?) rick From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 12:19:57 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 734A0106566B; Fri, 26 Aug 2011 12:19:57 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id E75098FC19; Fri, 26 Aug 2011 12:19:56 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap8EAEePV06DaFvO/2dsb2JhbAA6CBaENqRDgUABAQQBIwRSBRYOCgICDRkCWQYch2kEqUiRX4EsgX+CEIERBJMakRw X-IronPort-AV: E=Sophos;i="4.68,285,1312171200"; d="scan'208";a="135626406" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 26 Aug 2011 08:19:53 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 2C41AB3F9F; Fri, 26 Aug 2011 08:19:53 -0400 (EDT) Date: Fri, 26 Aug 2011 08:19:53 -0400 (EDT) From: Rick Macklem To: John Baldwin Message-ID: <2146752681.378273.1314361193159.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201108251347.45460.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 12:19:57 -0000 John Baldwin wrote: > I was doing some analysis of compiles over NFS at work recently and > noticed > from 'iostat 1' on the NFS server that all my NFS writes were always > 16k > writes (meaning that writes were never being clustered). I added some > debugging sysctls to the NFS client and server code as well as the FFS > write > VOP to figure out the various kind of write requests that were being > sent. I > found that during the NFS compile, the NFS client was sending a lot of > FILESYNC writes even though nothing in the compile process uses > fsync(). > Based on the debugging I added, I found that all of the FILESYNC > writes were > marked as such because the buffer in question did not have B_ASYNC > set: > > > if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) > == B_ASYNC) > iomode = NFSV3WRITE_UNSTABLE; > else > iomode = NFSV3WRITE_FILESYNC; > > I eventually tracked this down to the code in the NFS client that > pushes out a > previous dirty region via 'bwrite()' when a write would dirty a > non-contiguous > region in the buffer: > > if (bp->b_dirtyend > 0 && > (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) { > if (bwrite(bp) == EINTR) { > error = EINTR; > break; > } > goto again; > } > Btw, the code was correct to use FILESYNC for this case. Why? Well, if the b_dirtyoff, b_dirtyend are used by the "bottom half" for the write/commit RPCs, the client won't know to re-write/commit the range specified by b_dirtyoff/b_dirtyend after the range changes. (ie. If the server crashes/reboots between the UNSTABLE write and the commit, the change will get lost.) However, if you calculate the off, len arguments for the Commit RPC to cover the entire block and not just b_dirtyoff->b_dirtyend, then doing the write UNSTABLE should be fine. (Having the range larger than the what was written should be ok. In fact the FreeBSD server ignores the arguments and commits the entire file via VOP_FSYNC().) There is this comment before the bwrite(), which I don't quite understand: /* * If the new write will leave a contiguous dirty * area, just update the b_dirtyoff and b_dirtyend, * otherwise force a write rpc of the old dirty area. * * While it is possible to merge discontiguous writes due to * our having a B_CACHE buffer ( and thus valid read data * for the hole), we don't because it could lead to * significant cache coherency problems with multiple clients, * especially if locking is implemented later on. * I'm not sure what the author was referring to here, but the only case I can think of is: If two clients are concurrently writing to the same file and have non-overlapping byte ranges within the same block write locked, they will assume that the other client won't write data into the block they have locked. For this case, writing the specific range of b_dirtyoff->b_dirtyend instead of the entire block. For this case, the following does seem to apply. (Btw, for this to work correctly, the client must write any dirty regions back to the server + re-read the block from the server whenever any byte within the block is write locked or it won't see the changes done by the other client. I'm not sure if the FreeBSD client does this?) * as an optimization we could theoretically maintain * a linked list of discontinuous areas, but we would still * have to commit them separately so there isn't much * advantage to it except perhaps a bit of asynchronization. */ I think keeping a list of discontinuous regions would be useful. It would allow the bwrite() below this comment to be deleted and, if the client is lucky, the discontinuous regions would get filled in by another write and become an entire block before a write-back was needed. I think the correct first step is to finish off what you've already done by changing the range calculation for the Commit RPC so that it uses the entire block and not just b_dirtyoff->b_dirtyend. That way it is safe to do this write UNSTABLE. Then, I think it is worth doing a list of dirty byte ranges and hope that they fill in to a full block, which is the only time the block can be clustered. (Actually a byte range that starts at the beginning of the block can be clustered with previous blocks and a byte range that goes to the end of the block can be clustered with a subsequent block.) And, as I meant to say before (but I'm not sure if it was clear), I don't think the clustering will work until the "start the write immediately" is changed to "mark the block dirty and write it sometime soon". > (These writes are triggered during the compile of a file by the > assembler > seeking back into the file it has already written out to apply various > fixups.) > > From this I concluded that the test above is flawed. We should be > using > UNSTABLE writes for the writes above as the user has not requested > them to > be synchronous. The issue (I believe) is that the NFS client is > overloading > the B_ASYNC flag. The B_ASYNC flag means that the caller of bwrite() > (or rather bawrite()) is not synchronously blocking to see if the > request > has completed. Instead, it is a "fire and forget". This is not the > same > thing as the IO_SYNC flag passed in ioflags during a write request > which > requests fsync()-like behavior. It happens that, for this particular case, you need to do both a bwrite() that waits for completion and FILESYNC. (Maybe that was why the author had the code the way it is?) However, as noted above, the change to UNSTABLE only requires that the range calculation for commit cover the entire block instead of b_dirtyoff->b_dirtyend. > To disambiguate the two I added a new > B_SYNC flag and changed the NFS clients to set this for write requests > with IO_SYNC set. I then updated the condition above to instead check > for > B_SYNC being set rather than checking for B_ASYNC being clear. > Yes, sounds good to me. (Overloaded flags are almost never a great plan:-) > That converted all the FILESYNC write RPCs from my builds into > UNSTABLE > write RPCs. The patch for that is at > http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch. > > However, even with this change I was still not getting clustered > writes on > the NFS server (all writes were still 16k). After digging around in > the > code for a bit I found that ffs will only cluster writes if the passed > in > 'ioflags' to ffs_write() specify a sequential hint. I then noticed > that > the NFS server has code to keep track of sequential I/O heuristics for > reads, but not writes. I took the code from the NFS server's read op > and moved it into a function to compute a sequential I/O heuristic > that > could be shared by both reads and writes. I also updated the > sequential > heuristic code to advance the counter based on the number of 16k > blocks > in each write instead of just doing ++ to match what we do for local > file writes in sequential_heuristic() in vfs_vnops.c. Using this did > give me some measure of NFS write clustering (though I can't peg my > disks at MAXPHYS the way a dd to a file on a local filesystem can). > The > patch for these changes is at > http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch > > (This also fixes a bug in the new NFS server in that it wasn't > actually > clustering reads since it never updated nh->nh_nextr.) > > Combining the two changes together gave me about a 1% reduction in > wall > time for my builds: > > +------------------------------------------------------------------------------+ > |+ + ++ + +x++*x xx+x x x| > | |___________A__|_M_______|_A____________| | > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 10 1869.62 1943.11 1881.89 1886.12 21.549724 > + 10 1809.71 1886.53 1869.26 1860.706 21.530664 > Difference at 95.0% confidence > -25.414 +/- 20.2391 > -1.34742% +/- 1.07305% > (Student's t, pooled s = 21.5402) > > One caveat: I tested both of these patches on the old NFS client and > server > on 8.2-stable. I then ported the changes to the new client and server > and > while I made sure they compiled, I have not tested the new client and > server. > > -- > John Baldwin From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 14:19:12 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 33BCF106566B for ; Fri, 26 Aug 2011 14:19:12 +0000 (UTC) (envelope-from freebsd-fs@m.gmane.org) Received: from lo.gmane.org (lo.gmane.org [80.91.229.12]) by mx1.freebsd.org (Postfix) with ESMTP id E63E48FC12 for ; Fri, 26 Aug 2011 14:19:11 +0000 (UTC) Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1QwxFS-0006R3-NI for freebsd-fs@freebsd.org; Fri, 26 Aug 2011 16:19:06 +0200 Received: from ib-jtotz.ib.ic.ac.uk ([155.198.110.220]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 26 Aug 2011 16:19:06 +0200 Received: from jtotz by ib-jtotz.ib.ic.ac.uk with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 26 Aug 2011 16:19:06 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-fs@freebsd.org From: Johannes Totz Date: Fri, 26 Aug 2011 15:18:51 +0100 Lines: 25 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: ib-jtotz.ib.ic.ac.uk User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20110812 Thunderbird/6.0 Subject: Monitoring zfs arc size (for tuning) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 14:19:12 -0000 Heya, I've been monitoring my arc size, hit and miss rates etc to determine what a minimum arc size would be without performance going down the drain, for my workload. The only interesting point so far is that arc size still exceeds its limit occasionally! That is, kstat.zfs.misc.arcstats.size is larger than kstat.zfs.misc.arcstats.c_max (which is the same as vfs.zfs.arc_max) by a few hundred megabytes. Also, some of the periodic tasks that this machine does makes kstat.zfs.misc.arcstats.size go down to almost zero, and then shoot up to the max again in a short period of time. This is on: FreeBSD XXX 8.2-STABLE FreeBSD 8.2-STABLE #0 r224227: Wed Jul 20 16:55:23 BST 2011 root@XXX:/usr/obj/usr/src/sys/GENERIC amd64 Johannes From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 14:42:22 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 570201065673 for ; Fri, 26 Aug 2011 14:42:22 +0000 (UTC) (envelope-from a.smith@ukgrid.net) Received: from mx1.ukgrid.net (mx1.ukgrid.net [89.107.22.36]) by mx1.freebsd.org (Postfix) with ESMTP id 1E4648FC0C for ; Fri, 26 Aug 2011 14:42:21 +0000 (UTC) Received: from [89.21.28.38] (port=29069 helo=omicron.ukgrid.net) by mx1.ukgrid.net with esmtp (Exim 4.76; FreeBSD) envelope-from a.smith@ukgrid.net envelope-to freebsd-fs@freebsd.org id 1QwxFm-000L8j-3Y; Fri, 26 Aug 2011 15:19:26 +0100 Received: from 80.174.147.150.dyn.user.ono.com (80.174.147.150.dyn.user.ono.com [80.174.147.150]) by webmail2.ukgrid.net (Horde Framework) with HTTP; Fri, 26 Aug 2011 15:19:26 +0100 Message-ID: <20110826151926.94566b2qa2s8mhrk@webmail2.ukgrid.net> Date: Fri, 26 Aug 2011 15:19:26 +0100 From: a.smith@ukgrid.net To: freebsd-fs@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; format="flowed" Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Internet Messaging Program (IMP) H3 (4.3.9) / FreeBSD-8.1 Subject: LSI mps driver and STP support X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 14:42:22 -0000 Hi, apologies if this isn't the right list, as its not really a file system question (its storage). Does anyone know if the new mps driver for the new generation LSI SAS HBA´s supports SATA tunneled protocol for connecting SATA drives to a SAS expander disk chassis? And if yes is this stable? thanks for your help, Andy. From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 15:17:32 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 980B7106566B for ; Fri, 26 Aug 2011 15:17:32 +0000 (UTC) (envelope-from rincebrain@gmail.com) Received: from mail-qw0-f54.google.com (mail-qw0-f54.google.com [209.85.216.54]) by mx1.freebsd.org (Postfix) with ESMTP id 576C78FC14 for ; Fri, 26 Aug 2011 15:17:32 +0000 (UTC) Received: by qwc9 with SMTP id 9so2732638qwc.13 for ; Fri, 26 Aug 2011 08:17:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=FEv1gOGWfcerJ/jGdcnVXZR2kE+Zcy2gXmjsxevm7oI=; b=pzIp7tP17v5J90KMsLj6bhD3R9oqTTh2NaWfimb3fUVRgCoHxDDtlaKC4pS5bXMeXf WRdfDeUYEnAHSa4ybm0t1c67fToNvIe45BKnWQyltEp0pGIAftIUcGlnB17FHL+gTPjK bLZ5Bt74GmAvE6E+HG1w72baHzlwg4iLKWlXA= MIME-Version: 1.0 Received: by 10.229.65.84 with SMTP id h20mr1601582qci.52.1314370310557; Fri, 26 Aug 2011 07:51:50 -0700 (PDT) Sender: rincebrain@gmail.com Received: by 10.229.139.137 with HTTP; Fri, 26 Aug 2011 07:51:50 -0700 (PDT) In-Reply-To: <20110826151926.94566b2qa2s8mhrk@webmail2.ukgrid.net> References: <20110826151926.94566b2qa2s8mhrk@webmail2.ukgrid.net> Date: Fri, 26 Aug 2011 10:51:50 -0400 X-Google-Sender-Auth: 7jzZJ6N3jLApDXvoGA90KmToy4g Message-ID: From: Rich To: a.smith@ukgrid.net Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@freebsd.org Subject: Re: LSI mps driver and STP support X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 15:17:32 -0000 Yes. - Rich On Fri, Aug 26, 2011 at 10:19 AM, wrote: > Hi, > > =A0apologies if this isn't the right list, as its not really a file syste= m > question (its storage). > > Does anyone know if the new mps driver for the new generation LSI SAS HBA= =B4s > supports SATA tunneled protocol for connecting SATA drives to a SAS expan= der > disk chassis? And if yes is this stable? > > thanks for your help, Andy. > > > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 17:43:40 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AE55B1065674; Fri, 26 Aug 2011 17:43:40 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 3C5538FC13; Fri, 26 Aug 2011 17:43:39 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap4EABjaV06DaFvO/2dsb2JhbABDhEykUIFAAQYjBFIbDgwCDRkCWQaxApFrgSyED4ERBJMakRw X-IronPort-AV: E=Sophos;i="4.68,286,1312171200"; d="scan'208";a="132366829" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 26 Aug 2011 13:43:39 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 5B581B3F0F; Fri, 26 Aug 2011 13:43:39 -0400 (EDT) Date: Fri, 26 Aug 2011 13:43:39 -0400 (EDT) From: Rick Macklem To: John Baldwin Message-ID: <1882200362.409964.1314380619360.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <354219004.370807.1314320258153.JavaMail.root@erie.cs.uoguelph.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 17:43:40 -0000 Correcting myself yet again: > I eventually tracked this down to the code in the NFS client that > pushes out a > previous dirty region via 'bwrite()' when a write would dirty a > non-contiguous > region in the buffer: > > if (bp->b_dirtyend > 0 && > (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) { > if (bwrite(bp) == EINTR) { > error = EINTR; > break; > } > goto again; > } > Btw, the code was correct to use FILESYNC for this case. Why? Well, if the b_dirtyoff, b_dirtyend are used by the "bottom half" for the write/commit RPCs, the client won't know to re-write/commit the range specified by b_dirtyoff/b_dirtyend after the range changes. (ie. If the server crashes/reboots between the UNSTABLE write and the commit, the change will get lost.) However, if you calculate the off, len arguments for the Commit RPC to cover the entire block and not just b_dirtyoff->b_dirtyend, then doing the write UNSTABLE should be fine. (Having the range larger than the what was written should be ok. In fact the FreeBSD server ignore the arguments and commits the entire file via VOP_FSYNC().) I realize I was wrong w.r.t this. If the server crashes and reboots between the write RPCs and the Commit RPC, the client will only know the last byte range to re-write. For this to work correctly for UNSTABLE writes, a list of dirty byte ranges must be maintained and the client must do write RPCs for all of them (and do them again, if the server crashes before the commit). Btw, there is code in the NFSv4 stuff that handles a list of byte ranges. It does so for the byte range locking, but you could just rename struct nfscllock something without `lock` in it and then reuse nfscl_updatelock() to handle the list. (It might need a few tweaks for the non-lock case, but shouldn`t need much.) Hopefully I have finally got this correct and have not totally confused everyone, rick From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 17:45:52 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7D7B41065679; Fri, 26 Aug 2011 17:45:52 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au [211.29.132.188]) by mx1.freebsd.org (Postfix) with ESMTP id C9DAD8FC12; Fri, 26 Aug 2011 17:45:51 +0000 (UTC) Received: from c122-106-165-191.carlnfd1.nsw.optusnet.com.au (c122-106-165-191.carlnfd1.nsw.optusnet.com.au [122.106.165.191]) by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p7QHjlmm001436 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 27 Aug 2011 03:45:49 +1000 Date: Sat, 27 Aug 2011 03:45:47 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <201108251709.30072.jhb@freebsd.org> Message-ID: <20110827012609.H859@besplex.bde.org> References: <201108251347.45460.jhb@freebsd.org> <20110826043611.D2962@besplex.bde.org> <201108251709.30072.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 17:45:52 -0000 On Thu, 25 Aug 2011, John Baldwin wrote: > On Thursday, August 25, 2011 3:24:15 pm Bruce Evans wrote: >> On Thu, 25 Aug 2011, John Baldwin wrote: >> >>> I was doing some analysis of compiles over NFS at work recently and noticed >>> from 'iostat 1' on the NFS server that all my NFS writes were always 16k >>> writes (meaning that writes were never being clustered). I added some >> >> Did you see the old patches for this by Bjorn Gronwall? They went through >> many iterations. He was mainly interested in the !async case and I was >> mainly interested in the async case... > > Ah, no I had not seen these, thanks. I looked at your patches after writing the above. They look very similar, but the details are intricate. Unfortunately I forget most of the details. I reran some simple benchmarks (just iozone on a very old (~5.2) nfs client with various mount options, with netstat and systat to watch the resulting i/o on the server) on 3 different servers (~5.2 with Bjorn's patches, 8-current-2008 with Bjorn's patches, and -current-2011- March). The old client has many throughput problems, but strangely most of them are fixed by changing the server. >>> and moved it into a function to compute a sequential I/O heuristic that >>> could be shared by both reads and writes. I also updated the sequential >>> heuristic code to advance the counter based on the number of 16k blocks >>> in each write instead of just doing ++ to match what we do for local >>> file writes in sequential_heuristic() in vfs_vnops.c. Using this did >>> give me some measure of NFS write clustering (though I can't peg my >>> disks at MAXPHYS the way a dd to a file on a local filesystem can). The >> >> I got close to it. The failure modes were mostly burstiness of i/o, where >> the server buffer cache seemed to fill up so the client would stop sending >> and stay stopped for too long (several seconds; enough to reduce the >> throughput by 40-60%). > > Hmm, I can get writes up to around 40-50k, but not 128k. My test is to just > dd from /dev/zero to a file on the NFS client using a blocksize of 64k or so. I get mostly over 60K with old ata drivers that have a limit of 64K and mostly over 128K with not so old ata drivers that have a limit of 128. This is almost independent of the nfs client and server versions and mount options. I mostly tested async mounts, and mostly with an i/o size of just 512 for iozone (old-iozone 1024 512). It actually helps a little to have a minimal i/o size at the syscall level (to minimize latency at other levels; depends on CPU keeping up and kernel reblocking to better sizes). Throughputs with client defaults (-U,-r8192(?),-w8192(?), async,noatime) in 1e9 bytes were approximately: write read local disk: 48 53 5.2 server: 46 39 some bug usually makes the read direction slow 8 server: 46 39 cur server: 32 50+(?) writes 2/3 as fast due to not having patches but reads fixed (may also require tcp) Async on the server makes little difference. Contrary to what I said before, async on the client makes a big difference (it controls FILESYNC in a critical place). Now with noasync on the client: 5.2 server: 15 8 server: similar cur server: similar (worse, but not nearly 3/2 slower IIRC) There are just too many sync writes without async. But this is apparently mostly due to the default udp r/w sizes being too small, since tcp does much better, I think only due to its larger r/w sizes (I mostly don't use it because it has worse latency and more bugs in old nfs clients). Now with noasync,-T[-r32768(?),-w(32768)] on the client: 5.2 server: 34 37 8 server: 40+ (?) cur server: not tested The improvement is much larger for 8-server than for 5.2-server. That might be due to better tcp support, but I fear it is because 8-server is missing my fixes for ffs_update(). (The server file system was always ffs mounted async. Long ago, I got dyson to make fsync sort of work even when the file system is mounted async. VOP_FSYNC() writes data but not directory entries or inodes, except in my version it writes inodes. But actually writing the inode for every nfs FILESYNC probably doubles the number of i/o's. This is ameliorated as usual by a large i/o size at all levels, and by the disk lieing about actually writing the data so that doubling the number of writes doesn't give a full 2 times slowdown (I use old low end ATA disks with write caching enabled).) Now with async,-T[-r32768(?),-w(32768)] on the client: 5.2 server: 37 40 example of tcp not working well with 5.2 8 server: not carefully tested (similar to -U) cur server: not carefully tested (similar to -U) In other tests, toggling tcp/ucp and changing the block sizes makes hard to explain but not very important differences. It only magically fixes the case of an async client. My LAN uses a cheap switch but works almost perfectly for nfs over udp. I now remember that Bjorn was most interested in improving clustering for the noasync case. Clustering should happen almost automatically for the async case. Then lots of async writes should accumulate on the server and be written by a large cluster write. Any clustering at the nfs level would just get in the way. For the noasync case, FILESYNC will get in the way whenever it happens and it happens a lot, so I'm not sure how the server much opportunity for clustering. >>> patch for these changes is at >>> http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch >>> >>> (This also fixes a bug in the new NFS server in that it wasn't actually >>> clustering reads since it never updated nh->nh_nextr.) I'm still looking for the bug that makes reads slower. It doesn't seem to be clustering. >> Here is the version of Bjorn's patches that I last used (in 8-current in >> 2008): >> >> % Index: nfs_serv.c >> % =================================================================== >> % RCS file: /home/ncvs/src/sys/nfsserver/nfs_serv.c,v >> % retrieving revision 1.182 >> % diff -u -2 -r1.182 nfs_serv.c >> % --- nfs_serv.c 28 May 2008 16:23:17 -0000 1.182 >> % +++ nfs_serv.c 1 Jun 2008 05:52:45 -0000 >> ... >> % + /* >> % + * Locate best nfsheur[] candidate using double hashing. >> % + */ >> % + >> % + hi = NH_TAG(vp) % NUM_HEURISTIC; >> % + step = NH_TAG(vp) & HASH_MAXSTEP; >> % + step++; /* Step must not be zero. */ >> % + nh = &nfsheur[hi]; > > I can't speak to whether using a variable step makes an appreciable > difference. I have not examined that in detail in my tests. Generally, only small differences can be made by tuning hash methods. >> % + /* >> % + * Calculate heuristic >> % + */ >> % + >> % + lblocksize = vp->v_mount->mnt_stat.f_iosize; >> % + nblocks = howmany(uio->uio_resid, lblocksize); > > This is similar to what I pulled out of sequential_heuristic() except > that it doesn't hardcode 16k. There is a big comment above the 16k > that says it isn't about the blocksize though, so I'm not sure which is > most correct. I imagine we'd want to use the same strategy in both places > though. Comment from vfs_vnops.c: > > /* > * f_seqcount is in units of fixed-size blocks so that it > * depends mainly on the amount of sequential I/O and not > * much on the number of sequential I/O's. The fixed size > * of 16384 is hard-coded here since it is (not quite) just > * a magic size that works well here. This size is more > * closely related to the best I/O size for real disks than > * to any block size used by software. > */ > fp->f_seqcount += howmany(uio->uio_resid, 16384); Probably this doesn't matter. The above code in vfs_vnops.c is mostly by me. I think it is newer than the code in nfs_serv.c (strictly older, but nfs_serv.c has not caught up with it). I played a bit more with this in nfs_serv.c, to see if this should be different in nfs. In my local version, lblocksize can be set by a sysctl. But I only used this sysctl for testing, and don't remember it making any interesting differences. >> % + if (uio->uio_offset == nh->nh_nextoff) { >> % + nh->nh_seqcount += nblocks; >> % + if (nh->nh_seqcount > IO_SEQMAX) >> % + nh->nh_seqcount = IO_SEQMAX; >> % + } else if (uio->uio_offset == 0) { >> % + /* Seek to beginning of file, ignored. */ >> % + } else if (qabs(uio->uio_offset - nh->nh_nextoff) <= >> % + MAX_REORDERED_RPC*imax(lblocksize, uio->uio_resid)) { >> % + nfsrv_reordered_io++; /* Probably reordered RPC, do nothing. */ > > Ah, this is a nice touch! I had noticed reordered I/O's resetting my > clustered I/O count. I should try this extra step. Stats after a few GB of i/o: % vfs.nfsrv.commit_blks: 138037 % vfs.nfsrv.commit_miss: 2844 % vfs.nfsrv.reordered_io: 5170 % vfs.nfsrv.realign_test: 492003 % vfs.nfsrv.realign_count: 0 There were only a few reorderings. In old testing, I seemed to get best results by turning the number of nfsd's down to 1. I don't use this in production. I turn the number of nfsiod's down to 4 in production. >> % + } else >> % + nh->nh_seqcount /= 2; /* Not sequential access. */ > > Hmm, this is a bit different as well. sequential_heuristic() just > drops all clustering (seqcount = 1) here so I had followed that. I do > wonder if this change would be good for "normal" I/O as well? (Again, > I think it would do well to have "normal" I/O and NFS generally use > the same algorithm, but perhaps with the extra logic to handle reordered > writes more gracefully for NFS.) I don't know much about this. >> % + >> % + nh->nh_nextoff = uio->uio_offset + uio->uio_resid; > > Interesting. So this assumes the I/O never fails. Not too good. Some places like ffs_write() back out of failing i/o's, but I think they reduce ui_offset before the corresponding code for the non-nfs heuristic in vn_read/write() advances f_nextoff. >> % @@ -1225,4 +1251,5 @@ >> % vn_finished_write(mntp); >> % VFS_UNLOCK_GIANT(vfslocked); >> % + bwillwrite(); /* After VOP_WRITE to avoid reordering. */ >> % return(error); >> % } > > Hmm, this seems to be related to avoiding overloading the NFS server's > buffer cache? Just to avoid spurious reordering I think. Is this all still Giant locked? Giant might either reduce or increase interference between nfsd's, depending on the timing. >> ... >> % Index: nfs_syscalls.c >> % =================================================================== >> % RCS file: /home/ncvs/src/sys/nfsserver/Attic/nfs_syscalls.c,v >> % retrieving revision 1.119 >> % diff -u -2 -r1.119 nfs_syscalls.c >> % --- nfs_syscalls.c 30 Jun 2008 20:43:06 -0000 1.119 >> % +++ nfs_syscalls.c 2 Jul 2008 07:12:57 -0000 >> % @@ -86,5 +86,4 @@ >> % int nfsd_waiting = 0; >> % int nfsrv_numnfsd = 0; >> % -static int notstarted = 1; >> % >> % static int nfs_privport = 0; >> % @@ -448,7 +447,6 @@ >> % procrastinate = nfsrvw_procrastinate; >> % NFSD_UNLOCK(); >> % - if (writes_todo || (!(nd->nd_flag & ND_NFSV3) && >> % - nd->nd_procnum == NFSPROC_WRITE && >> % - procrastinate > 0 && !notstarted)) >> % + if (writes_todo || (nd->nd_procnum == NFSPROC_WRITE && >> % + procrastinate > 0)) >> % error = nfsrv_writegather(&nd, slp, >> % nfsd->nfsd_td, &mreq); > > This no longer seems to be present in 8. nfs_syscalls.c seems to have been replaced by nfs_srvkrpc.c. All history has been lost (obscured), but the code is quite different so a repo-copy wouldn't have worked much better. This created lots of garbage if not bugs: - the nfsrv.gathererdelay and nfsrv.gatherdelay_v3 sysctls are now in nfs_srvkrpc.c. They were already hard to associate with any effects, since their variables names don't match their sysctl names. The variables are named nfsrv_procrastinate and nfsrv_procrastinate_v3. - the *procrastinate* global variables are still declared in nfs.h and initialized to defaults in nfs_serv.c, but are no longer really used. - the local variable `procrastinate' and the above code to use it no longer exist - the macro for the default for the non-v3 sysctl, NFS_GATHERDELAY, is still defined in nfs.h, but is only used in the dead initialization. - the new nfs server doesn't have any gatherdelay or procrastinate symbols. Bjorn said that gatherdelay_v3 didn't work, and tried to fix it. The above is the final result that I have. I now remember trying this. Bjorn hoped that a nonzero gatherdelay would reduce reordering, but in practice it just reduces performance by waiting too long. Its default of 10 msec may have worked with 1 Mpbs ethernet, but can't possibly scale to 1 Gbps. ISTR that the value had to be very small, perhaps 100 usec, for the delay not to be too large, but when it is that small it has problems having any effects except to waste CPU in a different way than delaying. > One thing I had done was to use a separate set of heuristics for reading vs > writing. However, that is possibly dubious (and we don't do it for local > I/O), so I can easily drop that feature if desired. I think it is unlikely to make much difference. The heuristic always has to cover a very wide range of access patterns. Bruce From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 18:12:25 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 78BEA1065672 for ; Fri, 26 Aug 2011 18:12:25 +0000 (UTC) (envelope-from lev@FreeBSD.org) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [IPv6:2a01:4f8:131:60a2::2]) by mx1.freebsd.org (Postfix) with ESMTP id 185CB8FC13 for ; Fri, 26 Aug 2011 18:12:25 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:b1b7:d4b2:b3b3:a68b]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 6E4FA4AC59 for ; Fri, 26 Aug 2011 22:12:23 +0400 (MSD) Date: Fri, 26 Aug 2011 22:12:17 +0400 From: Lev Serebryakov Organization: FreeBSD X-Priority: 3 (Normal) Message-ID: <1164434239.20110826221217@serebryakov.spb.ru> To: fs@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=windows-1251 Content-Transfer-Encoding: quoted-printable Cc: Subject: Strange behaviour of UFS2+SU FS on FreeBSD 8-Stable: dreadful perofrmance for old data, excellent for new. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: lev@FreeBSD.org List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 18:12:25 -0000 Hello, Fs. It is ``common knowledge'', hat UFS don't need defragmentation. But It seems not to be true in my (corner?) case. I have FS which was growfs(8)ed from 2Tb (4x500Gb HDDs, all sizes are commercial, so, really, is slightly less) to 8Tb (4x2Tb HDDs). It was almost full (~15% free space) before growing. Now it is in much better shape: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D /dev/raid5/storage 7.1T 2T 4.5T 30% /usr/home =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D But ALL old data is read at speed about 20-30MiB/s (dd to /dev/null with bs=3D128k). I've checked top-10 files (by size) and all of them are read at such speed, like: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D blob# dd if=3D/usr/home/storage/Video/some-film.mkv of=3D/dev/null bs=3D128k 57972+1 records in 57972+1 records out 7598542537 bytes transferred in 305.013037 secs (24912189 bytes/sec) blob# =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D But it is not software RAID5 or UFS-as-whole problem. I could create file with random data at speed about 280MiB/s with simple program, which write 128Kb buffer with write(2) gain and again (up to 32Gb, for example) and after that this new file could be read with 175MiB/s speed (which is less than 50% of theoretical maximum, but is not bad, IMHO). Yes, this box has only 2GiB of RAM, so it IS NOT reading from cache: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D blob# ./generate big.file.dat Size: 34359738368 bytes, Speed: 283964779 bytes/s blob# dd if=3Dbig.file.dat of=3D/dev/null bs=3D128k 34359738368 bytes transferred in 196.044398 secs (175265086 bytes/sec) blob# =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D How could I improve situation with "old" data? "backup, recreate FS and restore" is not a variant, as I don't have 2TB+ redundant space at hands, and backup to one 2Tb external disk is not what I want at all. Does copy files one-by-one inside this FS helps? Like, create "/usr/home/copy" and copy everything to this directory, remove originals, and move-out all files from "/copy"? Or it is bad idea too? --=20 // Black Lion AKA Lev Serebryakov From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 18:49:52 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6B6AC106564A for ; Fri, 26 Aug 2011 18:49:52 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (chez.mckusick.com [70.36.157.235]) by mx1.freebsd.org (Postfix) with ESMTP id 31E0A8FC12 for ; Fri, 26 Aug 2011 18:49:52 +0000 (UTC) Received: from chez.mckusick.com (localhost [127.0.0.1]) by chez.mckusick.com (8.14.3/8.14.3) with ESMTP id p7QIXhlZ008671; Fri, 26 Aug 2011 11:33:43 -0700 (PDT) (envelope-from mckusick@chez.mckusick.com) Message-Id: <201108261833.p7QIXhlZ008671@chez.mckusick.com> To: lev@freebsd.org In-reply-to: <1164434239.20110826221217@serebryakov.spb.ru> Date: Fri, 26 Aug 2011 11:33:43 -0700 From: Kirk McKusick X-Spam-Status: No, score=0.0 required=5.0 tests=MISSING_MID, UNPARSEABLE_RELAY autolearn=failed version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on chez.mckusick.com Cc: fs@freebsd.org Subject: Re: Strange behaviour of UFS2+SU FS on FreeBSD 8-Stable: dreadful perofrmance for old data, excellent for new. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 18:49:52 -0000 If your old disk was full for a long time and had a lot of activity, the most recently created files are likely to have poor layout. Copying them should give them an improved layout now that much more space is available. Best is to copy one, then remove the original copy. Copy the next one and remove its old copy, etc. Please let me know if this works, as if it does not, something is wrong with growfs. Kirk McKusick From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 19:21:09 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EDAF8106566C; Fri, 26 Aug 2011 19:21:08 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 78AEB8FC13; Fri, 26 Aug 2011 19:21:08 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAIvxV06DaFvO/2dsb2JhbAA7CBaENqRRgUABAQEBAwEBARoGBCcgBgUbGAICDRkCKQEJJgYIBwQBHASHVahukWiBLIF/ghCBEQSRCoIQiD2IXw X-IronPort-AV: E=Sophos;i="4.68,286,1312171200"; d="scan'208";a="132379155" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 26 Aug 2011 15:21:07 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 8D316B3F1F; Fri, 26 Aug 2011 15:21:07 -0400 (EDT) Date: Fri, 26 Aug 2011 15:21:07 -0400 (EDT) From: Rick Macklem To: John Message-ID: <470466580.417415.1314386467564.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <20110825203151.GA61776@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org Subject: Re: F_RDLCK lock to FreeBSD NFS server fails to R/O target file [PATCH] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 19:21:09 -0000 John wrote: > After pondering the best way to allow the VOP_ACCESS() call to > only query for the permissions really needed, I've come up with > a patch that minimally adds one parameter to the nlm_get_vfs_state() > function call with the lock type from the original argp. > > http://people.freebsd.org/~jwd/nlm_prot_impl.c.accmode.patch > I took a look at it and it seemed mostly ok. However, please note that I am not familiar with the NLM code and try to avoid it like the plague.:-) One place I would suggest might want to be changed is the nlm_do_unlock() case. I don't think any file permission checking is needed for an unlock and it seems like it might fail when the called has VWRITE but not VREAD permissions on the file? Leaving a file locked is not a good situation. I would just not even do the VOP_ACCESS() call for that case. Maybe pass accmode == 0 into nlm_get_vfs_state() to indicate "skip the VOP_ACCESS() call". I think that this patch might be a little risky to put into head at this stage of the release cycle so, personally, I'd wait until after the 9.0 release before I'd look at committing it. Others might feel it's ok to go in now? rick > I'd appreciate a review and seeing what might be required to commit > this prior to 9 release. > > Thanks, > John > > ----- John's Original Message ----- > > Hi Fellow NFS'ers, > > > > I believe I have found the problem we've been having with read > > locks > > while attaching to a FreeBSD NFS server. > > > > In sys/nlm/nlm_prot_impl.c, function nlm_get_vfs_state(), there > > is a call > > to VOP_ACCESS() as follows: > > > > /* > > * Check cred. > > */ > > NLM_DEBUG(3, "nlm_get_vfs_state(): Calling > > VOP_ACCESS(VWRITE) with cred->cr_uid=%d\n",cred->cr_uid); > > error = VOP_ACCESS(vs->vs_vp, VWRITE, cred, curthread); > > if (error) { > > NLM_DEBUG(3, "nlm_get_vfs_state(): caller_name = %s > > VOP_ACCESS() returns %d\n", > > host->nh_caller_name, error); > > goto out; > > } > > > > The file being accessed is read only to the user, and open()ed > > with > > O_RDONLY. The lock being requested is for a read. > > > > fd = open(filename, O_RDONLY, 0); > > ... > > > > lblk.l_type = F_RDLCK; > > lblk.l_start = 0; > > lblk.l_whence= SEEK_SET; > > lblk.l_len = 0; > > lblk.l_pid = 0; > > rc = fcntl(fd, F_SETLK, &lblk); > > > > Running the above from a remote system, the lock call fails with > > errno set to ENOLCK. Given cred->cr_uid comes in as 227 which is > > my uid on the remote system. Since the file is R/O to me, and the > > VOP_ACCESS() is asking for VWRITE, it fails with errno 13, EACCES, > > Permission denied. > > > > The above operations work correctly to some of our other > > favorite big-name nfs vendors :-) > > > > Opinions on the "correct" way to fix this? > > > > 1. Since we're only asking for a read lock, why do we need to ask > > for VWRITE? I may not understand an underlying requirement for > > the VWRITE so please feel free to educate me if needed. > > > > Something like: request == F_RDLCK ? VREAD : VWRITE > > (need to figure out where to get the request from in this > > context). > > > > 2. Attempt VWRITE, fallback to VREAD... seems off to me though. > > > > 3. Other? > > > > I appreciate any thoughts on this. > > > > Thanks, > > John > > > > While they might not follow style(9) completely, I've uploaded > > my patch to nlm_prot.impl.c with the NLM_DEBUG() calls i've added. > > I'd appreciate it if someone would consider committing them so > > who ever debugs this file next will have them available. > > > > http://people.freebsd.org/~jwd/nlm_prot_impl.c.patch > > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 19:28:06 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A28EF106566C; Fri, 26 Aug 2011 19:28:06 +0000 (UTC) (envelope-from lev@FreeBSD.org) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [IPv6:2a01:4f8:131:60a2::2]) by mx1.freebsd.org (Postfix) with ESMTP id 3FAD58FC0A; Fri, 26 Aug 2011 19:28:06 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:b1b7:d4b2:b3b3:a68b]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id D78254AC31; Fri, 26 Aug 2011 23:28:03 +0400 (MSD) Date: Fri, 26 Aug 2011 23:27:58 +0400 From: Lev Serebryakov Organization: FreeBSD Project X-Priority: 3 (Normal) Message-ID: <1963980291.20110826232758@serebryakov.spb.ru> To: Kirk McKusick In-Reply-To: <201108261833.p7QIXhlZ008671@chez.mckusick.com> References: <1164434239.20110826221217@serebryakov.spb.ru> <201108261833.p7QIXhlZ008671@chez.mckusick.com> MIME-Version: 1.0 Content-Type: text/plain; charset=windows-1251 Content-Transfer-Encoding: quoted-printable Cc: fs@freebsd.org Subject: Re: Strange behaviour of UFS2+SU FS on FreeBSD 8-Stable: dreadful perofrmance for old data, excellent for new. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: lev@FreeBSD.org List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 19:28:06 -0000 Hello, Kirk. You wrote 26 =E0=E2=E3=F3=F1=F2=E0 2011 =E3., 22:33:43: > If your old disk was full for a long time and had a lot of activity, > the most recently created files are likely to have poor layout. > Copying them should give them an improved layout now that much > more space is available. Best is to copy one, then remove the=20 > original copy. Copy the next one and remove its old copy, etc. > Please let me know if this works, as if it does not, something > is wrong with growfs. Ok, I'll write script for this monotonous work :) I could even copy them one-by-one via other file system )copy out, delete, copy in). Will it be better or not? --=20 // Black Lion AKA Lev Serebryakov From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 20:52:48 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A4B37106566C; Fri, 26 Aug 2011 20:52:48 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (chez.mckusick.com [70.36.157.235]) by mx1.freebsd.org (Postfix) with ESMTP id 83F3D8FC14; Fri, 26 Aug 2011 20:52:48 +0000 (UTC) Received: from chez.mckusick.com (localhost [127.0.0.1]) by chez.mckusick.com (8.14.3/8.14.3) with ESMTP id p7QKqpen039191; Fri, 26 Aug 2011 13:52:51 -0700 (PDT) (envelope-from mckusick@chez.mckusick.com) Message-Id: <201108262052.p7QKqpen039191@chez.mckusick.com> To: lev@freebsd.org In-reply-to: <1963980291.20110826232758@serebryakov.spb.ru> Date: Fri, 26 Aug 2011 13:52:51 -0700 From: Kirk McKusick X-Spam-Status: No, score=0.0 required=5.0 tests=MISSING_MID, UNPARSEABLE_RELAY autolearn=failed version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on chez.mckusick.com Cc: fs@freebsd.org Subject: Re: Strange behaviour of UFS2+SU FS on FreeBSD 8-Stable: dreadful perofrmance for old data, excellent for new. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 20:52:48 -0000 > Date: Fri, 26 Aug 2011 23:27:58 +0400 > From: Lev Serebryakov > Organization: FreeBSD Project > To: Kirk McKusick > Cc: fs@freebsd.org > Subject: Re: Strange behaviour of UFS2+SU FS on FreeBSD 8-Stable: dreadful > perofrmance for old data, excellent for new. > > Hello, Kirk. > > You wrote 26 August 2011, 22:33:43: > > > If your old disk was full for a long time and had a lot of activity, > > the most recently created files are likely to have poor layout. > > Copying them should give them an improved layout now that much > > more space is available. Best is to copy one, then remove the=20 > > original copy. Copy the next one and remove its old copy, etc. > > Please let me know if this works, as if it does not, something > > is wrong with growfs. > > Ok, I'll write script for this monotonous work :) I could even copy > them one-by-one via other file system )copy out, delete, copy in). > Will it be better or not? > > -- > // Black Lion AKA Lev Serebryakov Given how much bigger your new filesystem has grown, copying out to another filesystem should not be necessary. However, it will likely be quicker to copy out to another filesystem as you will be using two spindles instead of seeking back and forth on one. But before doing all that work, try copying one file to ensure that you get the expected speedup. Kirk McKusick From owner-freebsd-fs@FreeBSD.ORG Sat Aug 27 02:58:26 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 133D8106564A for ; Sat, 27 Aug 2011 02:58:26 +0000 (UTC) (envelope-from freebsd@deman.com) Received: from plato.corp.nas.com (plato.corp.nas.com [66.114.32.138]) by mx1.freebsd.org (Postfix) with ESMTP id ECE7D8FC0C for ; Sat, 27 Aug 2011 02:58:25 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by plato.corp.nas.com (Postfix) with ESMTP id 3931BF2DFE70 for ; Fri, 26 Aug 2011 19:58:25 -0700 (PDT) X-Virus-Scanned: amavisd-new at corp.nas.com Received: from plato.corp.nas.com ([127.0.0.1]) by localhost (plato.corp.nas.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id AxJn2xLcgIoW for ; Fri, 26 Aug 2011 19:58:23 -0700 (PDT) Received: from [192.168.0.4] (97-115-69-183.ptld.qwest.net [97.115.69.183]) by plato.corp.nas.com (Postfix) with ESMTPSA id D77E0F2DFE61 for ; Fri, 26 Aug 2011 19:58:23 -0700 (PDT) From: Michael DeMan Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Date: Fri, 26 Aug 2011 19:58:22 -0700 Message-Id: To: freebsd-fs@freebsd.org Mime-Version: 1.0 (Apple Message framework v1084) X-Mailer: Apple Mail (2.1084) Subject: hast+zfs and pool expansion via disk replacements X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Aug 2011 02:58:26 -0000 Hi All, We are thinking about trying out a 2-way or 3-way mirror for a filer = based on FreeBSD, with HAST and ZFS. With traditional ZFS for expanding capacity in what I think of as a = traditional 'stripe over mirrors' - it is quite easy by replacing say = 500GB disks one at a time with 2TB in a given mirrored unit. How does this work if we are using hast? Is it transparent? Is it = impossible? Is it possible to do and still be 24x7 with a little bit of = command line work? Thanks, - Mike DeMan From owner-freebsd-fs@FreeBSD.ORG Sat Aug 27 07:21:24 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B4FCC106564A for ; Sat, 27 Aug 2011 07:21:24 +0000 (UTC) (envelope-from lev@FreeBSD.org) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [IPv6:2a01:4f8:131:60a2::2]) by mx1.freebsd.org (Postfix) with ESMTP id 52A6D8FC12 for ; Sat, 27 Aug 2011 07:21:24 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:b1b7:d4b2:b3b3:a68b]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 5DBFF4AC31; Sat, 27 Aug 2011 11:21:22 +0400 (MSD) Date: Sat, 27 Aug 2011 11:21:16 +0400 From: Lev Serebryakov Organization: FreeBSD Project X-Priority: 3 (Normal) Message-ID: <758608837.20110827112116@serebryakov.spb.ru> To: Kirk McKusick In-Reply-To: <201108262052.p7QKqpen039191@chez.mckusick.com> References: <1963980291.20110826232758@serebryakov.spb.ru> <201108262052.p7QKqpen039191@chez.mckusick.com> MIME-Version: 1.0 Content-Type: text/plain; charset=windows-1251 Content-Transfer-Encoding: quoted-printable Cc: fs@freebsd.org Subject: Re: Strange behaviour of UFS2+SU FS on FreeBSD 8-Stable: dreadful perofrmance for old data, excellent for new. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: lev@FreeBSD.org List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Aug 2011 07:21:24 -0000 Hello, Kirk. You wrote 27 =E0=E2=E3=F3=F1=F2=E0 2011 =E3., 0:52:51: > Given how much bigger your new filesystem has grown, copying out to > another filesystem should not be necessary. However, it will likely It is theoretically different strategies: (a) copy from FS to same FS, mand remove old copy later, force OS to choose new place for data. (b) copy out, remove, copy in allow OS to use same place. > be quicker to copy out to another filesystem as you will be using > two spindles instead of seeking back and forth on one. But before > doing all that work, try copying one file to ensure that you get the > expected speedup. Yep, three biggest files were speed up significantly. I'm going to investigate alter, why it is ony ~180MiB/s, when theoretically it should be about (90*4) 360MiB/s linear read, and whom to blame: UFS or geom_raid5 or both :) --=20 // Black Lion AKA Lev Serebryakov From owner-freebsd-fs@FreeBSD.ORG Sat Aug 27 11:21:29 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1C802106564A; Sat, 27 Aug 2011 11:21:29 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au [211.29.132.183]) by mx1.freebsd.org (Postfix) with ESMTP id 946528FC0C; Sat, 27 Aug 2011 11:21:28 +0000 (UTC) Received: from c122-106-165-191.carlnfd1.nsw.optusnet.com.au (c122-106-165-191.carlnfd1.nsw.optusnet.com.au [122.106.165.191]) by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p7RBLOPv024619 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 27 Aug 2011 21:21:25 +1000 Date: Sat, 27 Aug 2011 21:21:24 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Rick Macklem In-Reply-To: <354219004.370807.1314320258153.JavaMail.root@erie.cs.uoguelph.ca> Message-ID: <20110827194709.E1286@besplex.bde.org> References: <354219004.370807.1314320258153.JavaMail.root@erie.cs.uoguelph.ca> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Aug 2011 11:21:29 -0000 On Thu, 25 Aug 2011, Rick Macklem wrote: > John Baldwin wrote: >> ... >> That converted all the FILESYNC write RPCs from my builds into >> UNSTABLE >> write RPCs. The patch for that is at >> http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch. >> >> However, even with this change I was still not getting clustered >> writes on >> the NFS server (all writes were still 16k). After digging around in >> the >> code for a bit I found that ffs will only cluster writes if the passed >> in >> 'ioflags' to ffs_write() specify a sequential hint. I then noticed >> that >> the NFS server has code to keep track of sequential I/O heuristics for >> reads, but not writes. I took the code from the NFS server's read op >> and moved it into a function to compute a sequential I/O heuristic >> that >> could be shared by both reads and writes. I also updated the >> sequential >> heuristic code to advance the counter based on the number of 16k >> blocks >> in each write instead of just doing ++ to match what we do for local >> file writes in sequential_heuristic() in vfs_vnops.c. Using this did >> give me some measure of NFS write clustering (though I can't peg my >> disks at MAXPHYS the way a dd to a file on a local filesystem can). >> The >> patch for these changes is at >> http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch >> > The above says you understand this stuff and I don't. However, I will note I only know much about this part (I once actually understood it). > that the asynchronous case, which starts the write RPC now, makes clustering > difficult and limits what you can do. (I think it was done in the bad old async as opposed to delayed is bad, but is mostly avoided anyway, at at least ffs and vfs levels on the server. This was a major optimization by dysoon about 15 years ago. I don't understand the sync/async/ delayed writes on the client at the nfs level. At least the old nfsclient doesn't even call bawrite(), but it might do the equivalent using a flag. On the server, nfs doesn't use any of bwrite/bawrite/bdwrite(). It just uses VOP_WRITE() which does whatever the server file system does. Most file systems in FreeBSD use cluster_write() in most cases. This is from 4.4BSD-Lite. It replaces an unconditional bawrite() in Net/2 in the most usual case where the write is of exactly 1 fs-block (usually starting with a larger write that is split up into fs-blocks and a possible sub-block at the beginning and end only). cluster_write() also has major optimizations by dyson. In the usual case it turns into bdwrite(), to give a chance for a full cluster to accumulate, and in most cases there would be little difference in the effects if the callers were simplified to call bdwrite() directly. (The difference is just that with cluster_write(), a write will occur as soon as a cluster forms, while with bdwrite() a write will not occur until the next sync unless the buffer cache is very dirty. bawrite() used to be used instead of bdwrite() mainly to reduce pressure on the buffer cache. It was thought that the end of a block was a good time to start writing. That was when 16 buffers containing 4K each was a lot of data :-). The next and last major optimization in this area was to improve VOP_FSYNC() to handle a large number of delayed writes better. It was changed to uses vfs_bio_awrite() where in 4.4BSD it used bawrite(). vfs_bio_awrite() is closer to the implementation and has a better understanding of clustering than bawrite(). I forget why bawrite() wasn't just replaced by the internals of vfs_bio_awrite(). sync writes from nfs and O_SYNC from userland tend to defeat all of the bawrite()/bdwrite() optimizations, by forcing a bwrite(). nfs defaults to sync writes, so all it can do to use the optimizations is to do very large sync writes which are split up into smaller delayed ones in a way that doesn't interfere with clustering. I don't understand the details of what it does. > days to avoid flooding the buffer cache and then having things pushing writes > back to get buffers. These days the buffer cache can be much bigger and it's > easy to create kernel threads to do write backs at appropriate times. As such, > I'd lean away from asynchronous (as in start the write now) and towards delayed > writes. On FreeBSD servers, this is mostly handled already by mostly using cluster_write(). Buffer cache pressure is still difficult to handle though. I saw it having bad effects mainly in my silly benchmark for this nfs server clustering optimization, of writing 1GB. The buffer cache would fill up with dirty buffers which take too long to write (1000-2000 dirty ones out of 8000. 2000 of size 16K each is 32MB. These take 0.5-1 seconds to write). While they were being written, the nfsclient has to stop sending (it shouldn't stop until the buffer cache is completely full but it does). Any stoppage gives under-utilization of the network, and my network has just enough bandwidth to keep up with the disk. Stopping for a short time wouldn't be bad, but for some reason it didn't restart soon enough to keep the writes streaming. I didn't see this when I repeated the benchmark yesterday. I must have done some tuning to reduce the problem, but forget what it was. I would start looking for it near the buf_dirty_count_severe() test in ffs_write(). This defeats clustering and may be too agressive or mistuned. What I don't like about this is that when severe buffer cache pressure develops, using bawrite() instead of cluster_write() tends to increase the pressure, by writing new dirty buffers at half the speed. I never saw any problems from the buffer cache pressure with local disks (except for writing to DVDs, writes often stall near getblk() for several seconds). > If the writes are delayed "bdwrite()" then I think it is much easier > to find contiguous dirty buffers to do as one write RPC. However, if > you just do bdwrite()s, there tends to be big bursts of write RPCs when > the syncer does its thing, unless kernel threads are working through the > cache doing write backs. It might not matter a lot (except on large-latency links) what the client does. MTUs of only 1500 are still too common, so there is a lot of reassumble of blocks at the network level. A bit more at the RPC and (both client and server) block level won't matter provided you don't synchronize after every piece. Hmm, those bursts on the client aren't so good, and may explain why the client stalled in my tests. At least the old nfs client never uses either cluster_write() or vfs_bio_awrite() (or bawrite()). I don't understand why, but if if uses bdwrite() when it should use cluster_write() then it won't have the advantage of cluster_write() over bdwrite() -- of writing as soon as a cluster forms. It does use B_CLUSTEROK. I think this mainly causes clustering to work when all the delayed-write buffers are written eventually. Now I don't see much point in using either delayed writes or clustering on the client. Clustering is needed for unsolid state disks mainly because their seek time is so large. Larger blocks are only good for their secondary effects of reducing overheads and latency. > Since there are nfsiod threads, maybe these could scan for contiguous > dirty buffers and start big write RPCs for them? If there was some time > limit set for how long the buffer sits dirty before it gets a write started > for it, that would avoid a burst caused by the syncer. One of my tunings was to reduce the number of nfsiod's. > Also, if you are lucky w.r.t. doing delayed writes for temporary files, the > file gets deleted before the write-back. In ffs, this is another optimization by dyson. Isn't it defeated by sync writes from ffs? Is it possible for a file written on the client to never reach the server? Even if the data doesn't, I think the directory and inode creation should. Even for ffs mounted async, I think there are writes of some metadata for deleted files, because although the data blocks are dead, some metadata blocks like ones for inodes are shared with other files and must have been dirtied by create followed by delete, so they remain undead but are considered dirty although their accumulated changes should be null. The writes are just often coalesced by the delay, so instead of 1000 of writes to the same place for an inode that is created and deleted 500 times, you get just 1 write for null changes at the end. My version of ffs_update() has some optimizations to avoid writing null changes, but I think this doesn't help here since it still sees the changes in-core as they occur. Bruce From owner-freebsd-fs@FreeBSD.ORG Sat Aug 27 12:00:52 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6B848106566B; Sat, 27 Aug 2011 12:00:52 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-wy0-f182.google.com (mail-wy0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id 5073D8FC19; Sat, 27 Aug 2011 12:00:51 +0000 (UTC) Received: by wyh15 with SMTP id 15so3774791wyh.13 for ; Sat, 27 Aug 2011 05:00:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type; bh=RDMNnosI9KniBdB7WF2tbR4EhuP2E1SB7goIvMgHsRw=; b=PNLzFfRMAjWLKNzPDwkyP3ROoXp26csdPhZfcnefsmUYUfNMmCTBN5sblVFwcKVnf2 mLCwVx+G4VNdj/PA5qEXQtnocmGD5PrY+XZ+qsIxbqD9mPWEtzEL8NldKXCaCtpv/VUR 8AU5dbg1gTv08H0efkzW/MekPq6B23YdoedzM= MIME-Version: 1.0 Received: by 10.227.72.200 with SMTP id n8mr1083196wbj.19.1314446450094; Sat, 27 Aug 2011 05:00:50 -0700 (PDT) Sender: asmrookie@gmail.com Received: by 10.227.206.139 with HTTP; Sat, 27 Aug 2011 05:00:50 -0700 (PDT) Date: Sat, 27 Aug 2011 14:00:50 +0200 X-Google-Sender-Auth: vbAIcujJKXBpup-9FAoZdXFwgU4 Message-ID: From: Attilio Rao To: freebsd-arch@freebsd.org, freebsd-current@freebsd.org, FreeBSD FS Content-Type: text/plain; charset=UTF-8 Cc: Robert Watson , Konstantin Belousov Subject: Removal of Giant from the VFS layer for 10.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Aug 2011 12:00:52 -0000 [ Sorry for cross-posting, but I included -arch@ for technical discussion, -current@ for reaching the wider audience and -fs@ for the relevance of the matter.] During the last years a lot of effort by several developers happened in order to reduce Giant influence over the entire kernel. The VFS layer didn't make an exception, as many several tasks have been completed along the years, including fine-grained locking for vnodes lifecycle, fine-grained locking of the VFS structure (mount), fine-grained locking of specific filesystems (UFS, NFS, etc.) and several locking improvements to surrounding subsystem (buffer cache, filedesc objects, VM layer, etc.). While FreeBSD did pretty well so far, a major push is still needed in order to completely remove Giant from our VFS and buffer cache subsystems. At the present time, the biggest problem is that there are still filesystems which are not properly fine-grained locked, relying on Giant for assuring atomicity. It is time to make an decision for them, in order to aim for a Giant-less VFS in our next release. With the aid of kib and rwatson I made a roughly outlined plan about what is left to do in order to have all the filesystems locked (or eventually dropped) before 10.0) and is summarized here: http://wiki.freebsd.org/NONMPSAFE_DEORBIT_VFS As you can note from the page, the plan is thought to be 18 months long, including time for developers to convert all our filesystems and let thirdy-party producers do the same with their proprietary filesystems. Also the introduction (and later on removal) of VFS_GIANT_COMPATIBILITY is thought to stress-out the presence of not-yet MPSAFE filesystems used by consumers and force a proactive action. As you can note from the page, the list of filesystems to be converted is small and well contained, but there are some edge cases that really concerns me, like ntfs and smbfs. I took over leadership of ntfs, but if someone is willing to override myself, please just drop an e-mail and I'll happilly hand over someone else. About smbfs, I really think this is really the key filesystem we should fix in the list and it is time for someone to step up and do the job (including also locking and reworking netsmb). I knew there was a Google SoC going on this topic, but didn't have further updates to the matter in weeks. Ideally, after all the filesystems are locked, what should happen is to remove all Giant reference from the VFS, as kib's patch present in the wiki page. If some filesystem is still left for the 1st Semptember of next year, it is going to be disconnected from the tree along with Giant axing. As the locking of filesystems progresses, we can create subsections for each filesystems including technical notes on the matter. So fare there is none because the effort is still not started. The page is also thought to contain technical notes on how to operate the locking of filesystems, in more general way. I added the msdosfs example as a reference but other cases may have different problems. However, as the state of all the filesystems listed in the black page is a bit unknown, I'd suggest you to first make it work stable and just in the end work on locking. Also, please remind that locking doesn't need to be perfect at the first time, it is enough to bring the filesystem out of the Giant influence intially. Of course, for key filesystems (smbfs in primis) I'd expect to have full fine-grained locking support at some point. During the 18 months timeframe I'll send some reminder and status updates to these lists (monthly or bi-monthly). If there is anything else you want to discuss about this plan, don't hesitate to contact me. There is one last thing I want to stress out: this type of activities rely a lot on the audience to step up and make the job. Please don't expect someone else to fix the filesystem for you, but be proactive as much as you can, offering quality time for development, testing and reviews. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-fs@FreeBSD.ORG Sat Aug 27 15:07:19 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2DDD5106564A; Sat, 27 Aug 2011 15:07:19 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id BDA0A8FC0C; Sat, 27 Aug 2011 15:07:18 +0000 (UTC) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id p7RF4UMu001658 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 27 Aug 2011 18:04:30 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4) with ESMTP id p7RF4Utv034565; Sat, 27 Aug 2011 18:04:30 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4/Submit) id p7RF4Uqr034564; Sat, 27 Aug 2011 18:04:30 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 27 Aug 2011 18:04:30 +0300 From: Kostik Belousov To: Attilio Rao Message-ID: <20110827150430.GI17489@deviant.kiev.zoral.com.ua> References: Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="LzERIFExplvR0PTW" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-3.3 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00, DNS_FROM_OPENWHOIS autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: FreeBSD FS , freebsd-current@freebsd.org, Robert Watson , freebsd-arch@freebsd.org Subject: Re: Removal of Giant from the VFS layer for 10.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Aug 2011 15:07:19 -0000 --LzERIFExplvR0PTW Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Aug 27, 2011 at 02:00:50PM +0200, Attilio Rao wrote: > [ Sorry for cross-posting, but I included -arch@ for technical > discussion, -current@ for reaching the wider audience and -fs@ for the > relevance of the matter.] >=20 > During the last years a lot of effort by several developers happened > in order to reduce Giant influence over the entire kernel. > The VFS layer didn't make an exception, as many several tasks have > been completed along the years, including fine-grained locking for > vnodes lifecycle, fine-grained locking of the VFS structure (mount), > fine-grained locking of specific filesystems (UFS, NFS, etc.) and > several locking improvements to surrounding subsystem (buffer cache, > filedesc objects, VM layer, etc.). >=20 > While FreeBSD did pretty well so far, a major push is still needed in > order to completely remove Giant from our VFS and buffer cache > subsystems. > At the present time, the biggest problem is that there are still > filesystems which are not properly fine-grained locked, relying on > Giant for assuring atomicity. It is time to make an decision for them, > in order to aim for a Giant-less VFS in our next release. The scope of the project should be made slightly more concrete. If you do not use a non-mpsafe fs, then VFS does not acquire Giant. This is true at least for stable/8 and HEAD kernels, might be also true for stable/7, but I do not remember for sure. The aim of the project is to remove compatibility shims that conditionally acquire Giant on the as-needed basis to allow non-mpsafe filesystems to operate still under the usual locking regime. In other words, the project will not make anything much faster or scalable, but to remove some quite large amount of the crafty code from our VFS, which is, unfortunately, not known for the very clean interfaces. --LzERIFExplvR0PTW Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (FreeBSD) iEYEARECAAYFAk5ZB34ACgkQC3+MBN1Mb4hZtACfQ0FFi0h+ySq6/yqLdaa8TKb1 l7MAnju58Ptqb8WXmYsHvziA3XwRusP/ =yjMg -----END PGP SIGNATURE----- --LzERIFExplvR0PTW-- From owner-freebsd-fs@FreeBSD.ORG Sat Aug 27 17:03:11 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 77EC41065670 for ; Sat, 27 Aug 2011 17:03:11 +0000 (UTC) (envelope-from freebsd-fs@m.gmane.org) Received: from lo.gmane.org (lo.gmane.org [80.91.229.12]) by mx1.freebsd.org (Postfix) with ESMTP id 339908FC16 for ; Sat, 27 Aug 2011 17:03:10 +0000 (UTC) Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1QxMHj-0002X6-44 for freebsd-fs@freebsd.org; Sat, 27 Aug 2011 19:03:07 +0200 Received: from cpe-188-129-77-230.dynamic.amis.hr ([188.129.77.230]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 27 Aug 2011 19:03:07 +0200 Received: from ivoras by cpe-188-129-77-230.dynamic.amis.hr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 27 Aug 2011 19:03:07 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-fs@freebsd.org From: Ivan Voras Date: Sat, 27 Aug 2011 19:02:44 +0200 Lines: 10 Message-ID: References: <1963980291.20110826232758@serebryakov.spb.ru> <201108262052.p7QKqpen039191@chez.mckusick.com> <758608837.20110827112116@serebryakov.spb.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1251; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: cpe-188-129-77-230.dynamic.amis.hr User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20110624 Thunderbird/5.0 In-Reply-To: <758608837.20110827112116@serebryakov.spb.ru> Subject: Re: Strange behaviour of UFS2+SU FS on FreeBSD 8-Stable: dreadful perofrmance for old data, excellent for new. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Aug 2011 17:03:11 -0000 On 27.8.2011. 9:21, Lev Serebryakov wrote: > I'm going to investigate alter, why it is ony ~180MiB/s, when > theoretically it should be about (90*4) 360MiB/s linear read, and whom > to blame: UFS or geom_raid5 or both :) Try this: http://ivoras.net/blog/tree/2010-11-19.ufs-read-ahead.html (or it could be a hardware issue - controller bottleneck or something like that).