From owner-freebsd-fs@FreeBSD.ORG Sun Mar 27 14:01:57 2005 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id F24B916A4CE; Sun, 27 Mar 2005 14:01:56 +0000 (GMT) Received: from bgo1smout1.broadpark.no (bgo1smout1.broadpark.no [217.13.4.94]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8F55B43D2D; Sun, 27 Mar 2005 14:01:56 +0000 (GMT) (envelope-from des@des.no) Received: from bgo1sminn1.broadpark.no ([217.13.4.93]) by bgo1smout1.broadpark.no (Sun Java System Messaging Server 6.1 HotFix 0.05 (built Oct 21 2004)) with ESMTP id <0IE0000H1K0X9O90@bgo1smout1.broadpark.no>; Sun, 27 Mar 2005 15:55:45 +0200 (CEST) Received: from dsa.des.no ([80.203.228.37]) by bgo1sminn1.broadpark.no (Sun Java System Messaging Server 6.1 HotFix 0.05 (built Oct 21 2004)) with ESMTP id <0IE0004IXKCRX890@bgo1sminn1.broadpark.no>; Sun, 27 Mar 2005 16:02:51 +0200 (CEST) Received: by dsa.des.no (Pony Express, from userid 666) id 660A8BDC4B; Sun, 27 Mar 2005 16:01:16 +0200 (CEST) Received: from xps.des.no (xps.des.no [10.0.0.12]) by dsa.des.no (Pony Express) with ESMTP id 1084FBDC37; Sun, 27 Mar 2005 16:01:12 +0200 (CEST) Received: by xps.des.no (Postfix, from userid 1001) id 0875833C3E; Sun, 27 Mar 2005 16:01:12 +0200 (CEST) Date: Sun, 27 Mar 2005 16:01:11 +0200 From: des@des.no (=?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=) In-reply-to: <4244EAFD.1030304@samsco.org> To: Scott Long Message-id: <868y49w5lk.fsf@xps.des.no> MIME-version: 1.0 Content-type: text/plain; charset=iso-8859-1 Content-transfer-encoding: quoted-printable X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on dsa.des.no References: <200503260011.aa53448@salmon.maths.tcd.ie> <20050326031018.GB41481@VARK.MIT.EDU> <4244EAFD.1030304@samsco.org> User-Agent: Gnus/5.110002 (No Gnus v0.2) Emacs/21.3 (berkeley-unix) X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.0.2 X-Spam-Level: cc: David Malone cc: freebsd-fs@freebsd.org cc: David Schultz Subject: Re: UFS Subdirectory limit. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Mar 2005 14:01:57 -0000 Scott Long writes: > It would be much more worthwhile to introduce > a UFS3 that uses a more efficient directory layout (B-tree?) to provide > real value to increasing the nlink limitation. It would be even more worthwile to simply adopt an existing well- designed and well-tested file system, such as IBM JFS. DES --=20 Dag-Erling Sm=F8rgrav - des@des.no From owner-freebsd-fs@FreeBSD.ORG Sun Mar 27 15:32:09 2005 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A551816A4CE for ; Sun, 27 Mar 2005 15:32:09 +0000 (GMT) Received: from cyrus.watson.org (cyrus.watson.org [204.156.12.53]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6506343D48 for ; Sun, 27 Mar 2005 15:32:09 +0000 (GMT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by cyrus.watson.org (Postfix) with SMTP id 01A6246B6C; Sun, 27 Mar 2005 10:32:09 -0500 (EST) Date: Sun, 27 Mar 2005 15:29:02 +0000 (GMT) From: Robert Watson X-Sender: robert@fledge.watson.org To: David Malone In-Reply-To: <200503260935.aa92067@salmon.maths.tcd.ie> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-fs@freebsd.org Subject: Re: UFS Subdirectory limit. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Mar 2005 15:32:09 -0000 On Sat, 26 Mar 2005, David Malone wrote: > > Also, the more important > > concern is that large directories simply don't scale in UFS. Lookups > > are a linear operation, and while DIRHASH helps, it really doesn't scale > > well to 150k entries. > > It seems to work passably well actually, not that I've benchmarked it > carefully at this size. My junkmail maildir has 164953 entries at the > moment, and is pretty much continiously appended to without creating any > problems for the machine it lives on. Dirhash doesn't care if the > entries are subdirectories or files. > > If the directory entries are largely static, the name cache should do > all the work, and it is well capable of dealing with lots of files. > > We should definitely look at what sort of filesystem features we're > likely to need in the future, but I just wanted to see if we can offer > people a sloution that doesn't mean waiting for FreeBSD 6 or 7. FWIW, I regularly use directories with several hundred thousand files in them, and it works quite well for the set of operations I perform (typically, I only append new entries to the directory). This is with a cyrus server hosting fairly large shared folders -- in Cyrus, a maildir-like format is used. For example, the lists.linux.kernel directory references 430,000 individual files. Between UFS_DIRHASH and Cyrus's use of a cache file, opening the folder primarily consists of mmap'ing the cache file and then doing lookups, which occur quite quickly. My workload doesn't currently require large numbers of directories referenced by a similar directory, but based on the results I've had with large numbers of files, I can say it likely would work fine subject to the ability for UFS to express it. Robert N M Watson From owner-freebsd-fs@FreeBSD.ORG Sun Mar 27 16:08:32 2005 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 366C216A4CE; Sun, 27 Mar 2005 16:08:32 +0000 (GMT) Received: from critter.freebsd.dk (f170.freebsd.dk [212.242.86.170]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6FB9B43D3F; Sun, 27 Mar 2005 16:08:31 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.13.3/8.13.1) with ESMTP id j2RG8QNq022799; Sun, 27 Mar 2005 18:08:26 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: des@des.no (=?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=) From: "Poul-Henning Kamp" In-Reply-To: Your message of "Sun, 27 Mar 2005 16:01:11 +0200." <868y49w5lk.fsf@xps.des.no> Date: Sun, 27 Mar 2005 18:08:26 +0200 Message-ID: <22798.1111939706@critter.freebsd.dk> Sender: phk@critter.freebsd.dk cc: David Malone cc: freebsd-fs@freebsd.org cc: David Schultz Subject: Re: UFS Subdirectory limit. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Mar 2005 16:08:32 -0000 In message <868y49w5lk.fsf@xps.des.no>, =?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?= writes: >Scott Long writes: >> It would be much more worthwhile to introduce >> a UFS3 that uses a more efficient directory layout (B-tree?) to provide >> real value to increasing the nlink limitation. > >It would be even more worthwile to simply adopt an existing well- >designed and well-tested file system, such as IBM JFS. Even better: Do both and then on top of it encourage further research into filesystems which take modern usage of computing systems into account :-) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-fs@FreeBSD.ORG Sun Mar 27 18:39:01 2005 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 75D1E16A4CE; Sun, 27 Mar 2005 18:39:01 +0000 (GMT) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9376143D58; Sun, 27 Mar 2005 18:38:58 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from [192.168.254.201] ([192.168.254.201]) (authenticated bits=0) by pooker.samsco.org (8.13.1/8.13.1) with ESMTP id j2RIaq9E095814; Sun, 27 Mar 2005 11:36:52 -0700 (MST) (envelope-from scottl@samsco.org) Message-ID: <4247D19F.6010502@samsco.org> Date: Mon, 28 Mar 2005 02:42:55 -0700 From: Scott User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.7.5) Gecko/20050321 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Robert Watson References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.8 required=3.8 tests=ALL_TRUSTED autolearn=failed version=3.0.2 X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on pooker.samsco.org cc: David Malone cc: freebsd-fs@FreeBSD.org Subject: Re: UFS Subdirectory limit. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Mar 2005 18:39:01 -0000 Robert Watson wrote: > On Sat, 26 Mar 2005, David Malone wrote: > > >>>Also, the more important >>>concern is that large directories simply don't scale in UFS. Lookups >>>are a linear operation, and while DIRHASH helps, it really doesn't scale >>>well to 150k entries. >> >>It seems to work passably well actually, not that I've benchmarked it >>carefully at this size. My junkmail maildir has 164953 entries at the >>moment, and is pretty much continiously appended to without creating any >>problems for the machine it lives on. Dirhash doesn't care if the >>entries are subdirectories or files. >> >>If the directory entries are largely static, the name cache should do >>all the work, and it is well capable of dealing with lots of files. >> >>We should definitely look at what sort of filesystem features we're >>likely to need in the future, but I just wanted to see if we can offer >>people a sloution that doesn't mean waiting for FreeBSD 6 or 7. > > > FWIW, I regularly use directories with several hundred thousand files in > them, and it works quite well for the set of operations I perform > (typically, I only append new entries to the directory). This is with a > cyrus server hosting fairly large shared folders -- in Cyrus, a > maildir-like format is used. For example, the lists.linux.kernel > directory references 430,000 individual files. Between UFS_DIRHASH and > Cyrus's use of a cache file, opening the folder primarily consists of > mmap'ing the cache file and then doing lookups, which occur quite quickly. > My workload doesn't currently require large numbers of directories > referenced by a similar directory, but based on the results I've had with > large numbers of files, I can say it likely would work fine subject to the > ability for UFS to express it. > > Robert N M Watson > > Luckily, linear reads through a directory are nearly O(1) in UFS since ufs_lookup() caches the offset to the last entry read so that a subsequent call doesn't have to start from the beginning. I would suspect that this, along with the namei cache, DIRHASH, and cyrus's cache, all contribute together to make reading the spool directories non-painful for you. I would also suspect that there is little manual sorting going on since cyrus chooses names for new entries that are naturally sorted. I'm still not sure I would consider these behaviours to be representative of the normal, though. It would be quite interesting to profile the system while cyrus is trying to append or delete a mail file into one of your large spool directories. Would an application that isn't as well-written as cyrus behave as well? What about an application like Squid? Scott From owner-freebsd-fs@FreeBSD.ORG Sun Mar 27 20:45:11 2005 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 28AFF16A4CE; Sun, 27 Mar 2005 20:45:11 +0000 (GMT) Received: from salmon.maths.tcd.ie (salmon.maths.tcd.ie [134.226.81.11]) by mx1.FreeBSD.org (Postfix) with SMTP id 24D7643D1F; Sun, 27 Mar 2005 20:45:10 +0000 (GMT) (envelope-from dwmalone@maths.tcd.ie) Received: from walton.maths.tcd.ie by salmon.maths.tcd.ie with SMTP id ; 27 Mar 2005 21:45:09 +0100 (BST) To: Scott In-reply-to: Your message of "Mon, 28 Mar 2005 02:42:55 PDT." <4247D19F.6010502@samsco.org> X-Request-Do: Date: Sun, 27 Mar 2005 21:45:06 +0100 From: David Malone Message-ID: <200503272145.aa71162@salmon.maths.tcd.ie> cc: freebsd-fs@FreeBSD.org cc: Robert Watson Subject: Re: UFS Subdirectory limit. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Mar 2005 20:45:11 -0000 > Luckily, linear reads through a directory are nearly O(1) in UFS since > ufs_lookup() caches the offset to the last entry read so that a > subsequent call doesn't have to start from the beginning. (dirhash also has an equivelent optimisation 'cos that bit of ufs_lookup code isn't called when dirhash is in use) > Would > an application that isn't as well-written as cyrus behave as well? What > about an application like Squid? Random lookups should be almost O(1) with dirhash when you have many operations to amortise the cost of the hash over. You loose out with dirhash are when you make a small number of accesses to a large directory and all those entries live close to the beginning of the directory (or possibly when you're thrashing against dirhash's memory limit). If the directory entries are actually constant (as is the case with squid in truncate mode), then you should get ~O(1) but with a slightly smaller constant than when the directory entries are changing. Just to check, I'm running a benchmark at the moment to compare 150k directories either aranged as: 1) a flat 150k subdirectories of one directory, or 2) 150k directories arranged as a two levels with ~387 subdirectories. At the moment it looks like accessing files in either structure performs equivelently but it is a bit slower to build/remove the flat structure. I'll post the results once the run is complete. David. From owner-freebsd-fs@FreeBSD.ORG Mon Mar 28 04:48:57 2005 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7BE2616A4CE for ; Mon, 28 Mar 2005 04:48:57 +0000 (GMT) Received: from pimout1-ext.prodigy.net (pimout1-ext.prodigy.net [207.115.63.77]) by mx1.FreeBSD.org (Postfix) with ESMTP id DEBB043D1F for ; Mon, 28 Mar 2005 04:48:56 +0000 (GMT) (envelope-from julian@elischer.org) Received: from [192.168.2.2] (adsl-67-127-71-192.dsl.snfc21.pacbell.net [67.127.71.192])j2S4mqmU094162; Sun, 27 Mar 2005 23:48:55 -0500 Message-ID: <42478CAC.10305@elischer.org> Date: Sun, 27 Mar 2005 20:48:44 -0800 From: Julian Elischer User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.5) Gecko/20050214 X-Accept-Language: en, hu MIME-Version: 1.0 To: Poul-Henning Kamp , freebsd-fs@freebsd.org References: <17693.1111874886@critter.freebsd.dk> In-Reply-To: <17693.1111874886@critter.freebsd.dk> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: UFS Subdirectory limit. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Mar 2005 04:48:57 -0000 Poul-Henning Kamp wrote: > In message <20050326213048.GA33703@VARK.MIT.EDU>, David Schultz writes: > >>On Fri, Mar 25, 2005, Scott Long wrote: >> >>>David Schultz wrote: >>> >>>>On Sat, Mar 26, 2005, David Malone wrote: >>>> >>>> >>>>>There was a discussion on comp.unix.bsd.freebsd.misc about two weeks >>>>>ago, where someone had an application that used about 150K >>>>>subdirectories of a single directory. They wanted to move this >>>>>application to FreeBSD, but discovered that UFS is limited to 32K >>>>>subdirectories, because UFS's link count field is a signed 16 bit >>>>>quantity. Rewriting the application wasn't an option for them. > > > Has anybody here wondered how much searching a 150K directory would > suck performance wise ? > > I realize that with dir-hashing and vfs-cache it is not as bad as it > used to be, but I still think it will be unpleasant performance wise. We have a reason (*) to have 300000 entries in a directory.. once the dirhash cache size was made big enough, performance was acceptable. (*) (we didn't want to but had to for "a while until it's fixed") > From owner-freebsd-fs@FreeBSD.ORG Mon Mar 28 06:30:51 2005 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9308516A4CE; Mon, 28 Mar 2005 06:30:51 +0000 (GMT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 062DB43D54; Mon, 28 Mar 2005 06:30:51 +0000 (GMT) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.1/8.13.1) with ESMTP id j2S6Ug0U093684; Sun, 27 Mar 2005 22:30:46 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <200503280630.j2S6Ug0U093684@gw.catspoiler.org> Date: Sun, 27 Mar 2005 22:30:42 -0800 (PST) From: Don Lewis To: rwatson@FreeBSD.org In-Reply-To: MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: dwmalone@maths.tcd.ie cc: freebsd-fs@FreeBSD.org Subject: Re: UFS Subdirectory limit. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Mar 2005 06:30:51 -0000 On 27 Mar, Robert Watson wrote: > FWIW, I regularly use directories with several hundred thousand files in > them, and it works quite well for the set of operations I perform > (typically, I only append new entries to the directory). This is with a > cyrus server hosting fairly large shared folders -- in Cyrus, a > maildir-like format is used. For example, the lists.linux.kernel > directory references 430,000 individual files. Between UFS_DIRHASH and > Cyrus's use of a cache file, opening the folder primarily consists of > mmap'ing the cache file and then doing lookups, which occur quite quickly. I'm doing the same here and performance is OK for day to day operations, but cloning the mail store is glacially slow. I had to move my mail store from a dieing disk to a replacement disk, and migrating about 5GB of data took the better part of a day. As the destination disk started to fill up, the I/O rates dropped to very low levels and the CPU was pegged at close to 100%, mostly system time. The problem is that the allocation strategy of locating file inodes and their data blocks in the same cylinder group as their parent directory fails badly with directories of this size. Once the cylinder group fills, the search for free inodes and blocks gets very slow. Using bigger cylinder groups would help a lot, especially considering disk sizes these days, but both UFS and UFS2 have inconveniently small maximum cylinder group sizes. I think it would make sense to be able to cluster groups of cylinder groups together for allocation purposes. Something else that would seem to make a lot of sense would be to implement something like the maxbpg limit for large directories that would force the inode allocator to start allocating from another cylinder group after a directory grows past a certain size. > My workload doesn't currently require large numbers of directories > referenced by a similar directory, but based on the results I've had with > large numbers of files, I can say it likely would work fine subject to the > ability for UFS to express it. From owner-freebsd-fs@FreeBSD.ORG Mon Mar 28 15:35:09 2005 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 792A016A4CE; Mon, 28 Mar 2005 15:35:09 +0000 (GMT) Received: from salmon.maths.tcd.ie (salmon.maths.tcd.ie [134.226.81.11]) by mx1.FreeBSD.org (Postfix) with SMTP id 596D043D5E; Mon, 28 Mar 2005 15:35:08 +0000 (GMT) (envelope-from dwmalone@maths.tcd.ie) Received: from walton.maths.tcd.ie by salmon.maths.tcd.ie with SMTP id ; 28 Mar 2005 16:35:07 +0100 (BST) Date: Mon, 28 Mar 2005 16:35:06 +0100 From: David Malone To: Scott Message-ID: <20050328153506.GA198@walton.maths.tcd.ie> References: <4247D19F.6010502@samsco.org> <200503272145.aa71162@salmon.maths.tcd.ie> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200503272145.aa71162@salmon.maths.tcd.ie> User-Agent: Mutt/1.5.6i Sender: dwmalone@maths.tcd.ie cc: freebsd-fs@FreeBSD.org cc: Robert Watson Subject: Re: UFS Subdirectory limit. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Mar 2005 15:35:09 -0000 Here's the benchmark results comparing a two level scheme (which I've labeled "sqrt") with a single directory with 150000 subdirectories (which I've labeled "flat"). The benchmark is in 4 phases: mkdir) This builds the directory structure. write) This writes a small amount of data into 100000 files in a pseudo random sequence of subdirectories. read) This reads back the data from each of the 100000 files (in the same order they were written). rm) This does an "rm -fr" of the whole tree. I just used /usr/bin/time on each phase and synced out the data between each phase. The results (averaged over 4 runs, see the end of the mail for the output of ministat on the data). real time user time sys time mkdir write read rm | mkdir write read rm | mkdir write read rm sqrt 499 4302 2409 1569 | 1.84 1.94 1.72 1.69 | 29.9 33.5 21.3 161.6 flat 1172 4318 2407 1717 | 1.47 1.62 1.52 1.66 | 26.1 33.5 20.6 158.1 So, it seems that while making the directory structure takes a bit longer for the flat method, there's no significant penality in real time for using it. The user times are pretty irrelevant (though the flat scheme is slightly faster, probably because some of the phases don't do sqrts ;-). Interestingly, the system times for the flat structure are actually *better* than the two level structure! I think this supports Don's suggestion that the layout of data on the disk with very large directories is not as good as it could be. (The test was done on an amd64 machine with gobs of ram. I used my patch to get large directories, which saves a metadata op per mkdir and rmdir, even in the sqrt case. I upped the amount of memory available to dirhash, though it didn't actually use more than about 2.5MB during the benchmark. Maxvnodes is set to 100000, so 150K dirs plus 100K files should be enough to make the name cache and vnode cache work hard.) David. x sqrt-real-mkdir + flat-real-mkdir +--------------------------------------------------------------------------+ | + | |x x x x + ++| ||__AM_| |A|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 469.16 527.83 513.14 499.2525 26.256259 + 4 1157.18 1182.42 1176.14 1172.1225 10.737021 Difference at 95.0% confidence 672.87 +/- 34.7068 134.775% +/- 6.95175% (Student's t, pooled s = 20.0583) x sqrt-real-write + flat-real-write +--------------------------------------------------------------------------+ |x + x x + x + + | | |________|________A____M___________|___A________________M_____________|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 4286.79 4317.17 4305.92 4302.0775 12.774457 + 4 4288.35 4339.77 4330.78 4318.195 22.606829 No difference proven at 95.0% confidence x sqrt-real-read + flat-real-read +--------------------------------------------------------------------------+ |+ + xx * +x| | |_________________|_____________A_______A_____M_______________|| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 2404.34 2417.4 2411.22 2409.3875 6.2196322 + 4 2396.65 2417.16 2411.18 2407.08 8.9677905 No difference proven at 95.0% confidence x sqrt-real-rm + flat-real-rm +--------------------------------------------------------------------------+ | x + | | x x x + + + | ||___AM_| |__A_M|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 1562.17 1578.65 1572.48 1568.9625 8.0307466 + 4 1707.86 1722.16 1721.65 1717.3925 6.6327841 Difference at 95.0% confidence 148.43 +/- 12.7436 9.46039% +/- 0.812231% (Student's t, pooled s = 7.36501) x sqrt-user-mkdir + flat-user-mkdir +--------------------------------------------------------------------------+ | + | |+ + + x x x x| | |______________AM_____________| |__AM_| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 1.81 1.87 1.85 1.84 0.025819889 + 4 1.32 1.6 1.48 1.47 0.11489125 Difference at 95.0% confidence -0.37 +/- 0.144075 -20.1087% +/- 7.83019% (Student's t, pooled s = 0.0832666) x sqrt-user-write + flat-user-write +--------------------------------------------------------------------------+ |+ + + + x x x x| | |____A_M___| |_____________A__M___________| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 1.82 2.06 1.95 1.935 0.099498744 + 4 1.57 1.66 1.63 1.62 0.037416574 Difference at 95.0% confidence -0.315 +/- 0.13006 -16.2791% +/- 6.72144% (Student's t, pooled s = 0.0751665) x sqrt-user-read + flat-user-read +--------------------------------------------------------------------------+ |+ + x+ + x x x | | |_______________________A_______M_____|_________|_____A______M________|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 1.56 1.81 1.77 1.72 0.11045361 + 4 1.33 1.71 1.57 1.515 0.16278821 No difference proven at 95.0% confidence x sqrt-user-rm + flat-user-rm +--------------------------------------------------------------------------+ |x + + x + + x x | | |_________|________________A____A_____M______|______M__________|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 1.4 1.89 1.84 1.695 0.22218611 + 4 1.54 1.8 1.74 1.665 0.12476645 No difference proven at 95.0% confidence x sqrt-sys-mkdir + flat-sys-mkdir +--------------------------------------------------------------------------+ | ++ + + xx xx| ||_______A________| |___A__M|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 29.62 30.15 30.06 29.89 0.25495098 + 4 25.7 26.84 26.07 26.1 0.5178803 Difference at 95.0% confidence -3.79 +/- 0.706247 -12.6798% +/- 2.36282% (Student's t, pooled s = 0.408167) x sqrt-sys-write + flat-sys-write +--------------------------------------------------------------------------+ |+ x + x x x + +| | |________________|___________A_AM_________|__M_________________| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 33.21 33.8 33.57 33.5275 0.24281337 + 4 32.81 34.25 33.83 33.565 0.61846584 No difference proven at 95.0% confidence x sqrt-sys-read + flat-sys-read +--------------------------------------------------------------------------+ |+ + + x + x x x| | |___________A___M_______||_______________M__A__________________| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 20.94 21.97 21.27 21.33 0.44773504 + 4 20.33 21 20.71 20.6325 0.29033027 Difference at 95.0% confidence -0.6975 +/- 0.652893 -3.27004% +/- 3.06092% (Student's t, pooled s = 0.377332) x sqrt-sys-rm + flat-sys-rm +--------------------------------------------------------------------------+ |x + + * x + x | | |_______________________A_____A____M________M__|____________|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 138.41 175.09 168.91 161.61 16.115177 + 4 141.94 170.84 164.4 158.175 12.513687 No difference proven at 95.0% confidence From owner-freebsd-fs@FreeBSD.ORG Mon Mar 28 20:03:26 2005 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E276716A4CE; Mon, 28 Mar 2005 20:03:26 +0000 (GMT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4911743D2F; Mon, 28 Mar 2005 20:03:26 +0000 (GMT) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.1/8.13.1) with ESMTP id j2SK3GEV095360; Mon, 28 Mar 2005 12:03:20 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <200503282003.j2SK3GEV095360@gw.catspoiler.org> Date: Mon, 28 Mar 2005 12:03:16 -0800 (PST) From: Don Lewis To: dwmalone@maths.tcd.ie In-Reply-To: <20050328153506.GA198@walton.maths.tcd.ie> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: freebsd-fs@FreeBSD.org cc: rwatson@FreeBSD.org Subject: Re: UFS Subdirectory limit. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Mar 2005 20:03:27 -0000 On 28 Mar, David Malone wrote: > Here's the benchmark results comparing a two level scheme (which > I've labeled "sqrt") with a single directory with 150000 subdirectories > (which I've labeled "flat"). > > The benchmark is in 4 phases: > > mkdir) This builds the directory structure. > write) This writes a small amount of data into 100000 files > in a pseudo random sequence of subdirectories. > read) This reads back the data from each of the 100000 > files (in the same order they were written). > rm) This does an "rm -fr" of the whole tree. > > I just used /usr/bin/time on each phase and synced out the data > between each phase. The results (averaged over 4 runs, see the end > of the mail for the output of ministat on the data). > > real time user time sys time > mkdir write read rm | mkdir write read rm | mkdir write read rm > sqrt 499 4302 2409 1569 | 1.84 1.94 1.72 1.69 | 29.9 33.5 21.3 161.6 > flat 1172 4318 2407 1717 | 1.47 1.62 1.52 1.66 | 26.1 33.5 20.6 158.1 > > So, it seems that while making the directory structure takes a bit > longer for the flat method, there's no significant penality in real > time for using it. The user times are pretty irrelevant (though the > flat scheme is slightly faster, probably because some of the phases > don't do sqrts ;-). > > Interestingly, the system times for the flat structure are actually > *better* than the two level structure! I think this supports Don's > suggestion that the layout of data on the disk with very large > directories is not as good as it could be. Just for grins, you might want to try a "very-flat" experiment where you create all 100000 files in the top directory. Traditionally directories were always allocated in another cylinder group than their parent, which would spread them all over the disk. This turns out to be somewhat sub-optimal because it causes an excessive amount of seek activity when traversing large directory trees. When the dirpref code was added, it allowed a limited number of subdirectories to be allocated using the same cylinder group as their parent, but I suspect that the allocations will still be fairly well distributed when running your benchmark.