From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 5 00:29:34 2003 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EB48416A4BF; Fri, 5 Sep 2003 00:29:33 -0700 (PDT) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id E9F4643FBD; Fri, 5 Sep 2003 00:29:32 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from user-2ivfjg5.dialup.mindspring.com ([165.247.206.5] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 19vB25-0005qK-00; Fri, 05 Sep 2003 00:29:26 -0700 Message-ID: <3F583B1F.313B81DE@mindspring.com> Date: Fri, 05 Sep 2003 00:28:31 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Geoff Buckingham References: <3F56F3FD.C636781@mindspring.com> <20030904110155.GA35273@chuggalug.clues.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a461d86e4e29bcea899dae01341e67b08da2d4e88014a4647c350badd9bab72f9c350badd9bab72f9c cc: freebsd-hackers@freebsd.org cc: freebsd-performance@freebsd.org cc: Petri Helenius cc: Max Clark Subject: Re: 20TB Storage System (fsck????) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Sep 2003 07:29:34 -0000 Geoff Buckingham wrote: > On Thu, Sep 04, 2003 at 01:12:45AM -0700, Terry Lambert wrote: > > Yes. Limit the number of CG bitmaps you examine simultaneously, > > and make the operation multiple pass over the disk. This is not > > that hard a modification to fsck, and it can be done fairly > > quickly by anyone who understands the code. The code in time to > > fsck the disk will go up inversely proportionally to the amount > > of RAM it's allowed to use, which is limited to the UVA size > > minus the fsck program size itself, and the fsck buffers used for > > things like FS metadata for a given file/directory. > > Pardon my ignorance but does the number of inodes in the filesystem have a > significant impact on the memory requirement of fsck? I can't answer empirically, but extrapolating from the empirical data that I *do* have, the time is going to go up proportional to the number of blocks in use, and the number of blocks in use is going to equal the average number of blocks per file times the number of files, and given that there is one inode per file, you will bound the amount of blocks by bounding the number of inodes. This makes the answer "yes, indirectly". What passes get run really depend on how your FS is configured. By default, a background fsck will only check for blocks that are marked as used in the CG bitmaps that are not actually used; so this is a CG bitmap vs. all inodes direct and indirect block lists consistency check only. Most of the incremental or multipass techniques I've discussed on the mailing list assume either a full fsck, or that you are able to lock individual CGs, or at least ranges on the disk, if you wish to do a BG check; or that you read-only the entire disk until you are done, and maintain a list of "needs update" items (this can be very compact, since it can be run-length encoded or otherwise highly compressed). If you read the fsck manual page and understand what it means, you can get some idea of what parameters effect it in what phases: Inconsistencies checked are as follows: 1. Blocks claimed by more than one inode or the free map. Every inode needs to be scanned to see what blocks in the free map should not be in the free map. The "free map" is the set of bits in the set of all cylinder group bitmaps. Cross-checking multiple references for directories is a process of combinatorial math. You take N inodes 2 at a time and compare them. A trick that is possible, if you are willing to rebuild the CG bitmaps in core, or are willing to double the space for the set you are examining would be to examine a range, and at the same time keep a shadow. Zero the shadow, and pass the list of inodes once, setting bits in the shadow. If you go to set a bit and it's already set, *then* you go back and find out who it was who had the bit set. This is probably an OK trade-off, particularly if you maintain a list of "this-file-this-suspect-bit, and then pass the FS again (large numbers of cross-linked blocks are rare). The second of these operations is as expensive as: #inodes_used*(#inodes_used-1)*(#indirect_blocks**2-1) 2. Blocks claimed by an inode outside the range of the filesystem. What this really should say is "which are outside the range". In other words, bogus block numbers. This is a compare that can be made during a direct linear search. 3. Incorrect link counts. This is a directory entry vs. inode count. The expense of this operation depends on whether you are directory-entry-major or on your pass, and the relative number of directory entries vs. inodes. For most FS's, the number of entries is going to be ~15% higher than the number of inodes; this is because of the hard links to directories from their parents, and to parent directories from their child directories. This number could be much, much higher on an FS with a large number of hard links per file. The thing you have to worry about is tracking the number of hard links per inode, and whether you can do this all in memory (e.g. with a linear array of integers of the same type size as the link count, whose length is equal to the number of inodes available in the system), or whether you have to break the job up and pass over the directory structure multiple times. If you can't keep all the items in memory, and must make multiple passes, then it's better to be inode-major; otherwise, it's better to be directory-major. 4. Size checks: Directory size not a multiple of DIRBLKSIZ. Simple check; can be done during one of the single linear passes. Partially truncated file. Also a linear check, but somewhat harder to handle. 5. Bad inode format. Self-inconsistent contents on inodes. 6. Blocks not accounted for anywhere. Every blocks in the free map that's not there and should be, because it's not claimed by a directory or inode. The "free map" is the set of bits in the set of all cylinder group bitmaps. This is the background fsck on an FS with soft updates case. 7. Directory checks: File pointing to unallocated inode. Directory says it's there, inode say's it's not. Linear pass over the directory space, looking up each inode. Inode number out of range. Linear pass over the directory space, looking at each directory entry. Inode is out of range if it's not in the set of inodes per cylinder group times number of cylinder groups. Directories with unallocated blocks (holes). Directories are not allowed to be sparse, since they are accessed via block I/O, linearly, from first byte to last, in order to scan for matches on lookup/create/rename/iteration operations. Dot or dot-dot not the first two entries of a directory or Internal consistency check. having the wrong inode number. Inode number of child and parent do not match expected corresponding values; this is a one element lookahead hierachical traversal, so it's not quite linear; best handled by depth-first recursive descent. 8. Super Block checks: More blocks for inodes than there are in the filesystem. Bad free block map format. Total free block and/or free inode count incorrect. Trivial checks. Free block/free inode are maintained as part of the other checks, so the superblock is kept in core for the duration. > I ask as it was previously stated the smallest file on the 10TB filessytem > would be 500MB which would enable a vastley reduced number of inodes and > possibly very large block fragment and cluster sizes? The thing that matters is the number of allocated inodes, not the number of total inodes. If they aren't allocated, then they don't have blocks allocated to them, and if the don't have blocks allocated to them, then those blocks don't ned to be checked, and they don't need to be checked. Obviously, the above information is just a brief oversimplification, but it gives you a 50,000 foot view of the issues and trade-offs you could make to fit in less RAM. For more detailed information: The UNIX+ File System Check Program http://citeseer.nj.nec.com/31351.html A Fast File System for UNIX (1984) http://citeseer.nj.nec.com/mckusick84fast.html Basically, if you want to learn, you're going to have to read. 8-). -- Terry