From owner-freebsd-current Mon Mar 17 20:59:54 2003 Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D3F8D37B401; Mon, 17 Mar 2003 20:59:50 -0800 (PST) Received: from goliath.cnchost.com (goliath.cnchost.com [207.155.252.47]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1CE2843F75; Mon, 17 Mar 2003 20:59:50 -0800 (PST) (envelope-from bakul@bitblocks.com) Received: from bitblocks.com (adsl-209-204-185-216.sonic.net [209.204.185.216]) by goliath.cnchost.com id XAA15021; Mon, 17 Mar 2003 23:59:46 -0500 (EST) [ConcentricHost SMTP Relay 1.15] Message-ID: <200303180459.XAA15021@goliath.cnchost.com> To: Terry Lambert Cc: Julian Elischer , FreeBSD current users , fs@FreeBSD.ORG Subject: Re: Anyone working on fsck? In-reply-to: Your message of "Mon, 17 Mar 2003 19:41:34 PST." <3E76956E.5E0FCC6B@mindspring.com> Date: Mon, 17 Mar 2003 20:59:45 -0800 From: Bakul Shah Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG > > UFS is the real problem here, not fsck. Its tradeoffs for > > improving normal access latencies may have been right in the > > past but not for modern big disks. ... > Sorry, but the track-to-track seek latency optimizations you > are referring to are turned off, given the newfs defaults, and > have been for a very long time. I was thinking of the basic idea of cylinder groups as good for normal load, not so good for fsck when you have too many CGs. I wasn't thinking of what fsck does or does not do. > Basically, the problem is that for a BG fsck, it's not possible > to lock access on a cylinder group basis, and then there is a hell > of a lot of data on that drive. Note that Julian said 6 hours to fsck a TB in the normal foreground mode. > What Julian needs is a Journalling or Log-structured FS. All that Julian wants is a faster fsck without mucking with the FS! While I agree with you that you do need a full consistency check, it is worth thinking about how one can avoid that whenever possible. For example, if you can know where the disk head is at the time of crash (based on what blocks were being written) it should be possible to avoid a full check. > The easiest way to do this is to provide some sort of domain > control, so that you limit the files you have to check to a > smaller set of files per CG, so that you don't have to check > all the files to check a particular CG -- and then preload the > CG's for the sets of files you *do* have to check. If you have to visit a CG (during fsck) you have already paid the cost of the seek and rotational latency. Journalling wouldn't help here if you still have a zillion CGs. > Another way of saying this is... don't put all your space in a > single FS. 8-). Or in effect treat each CG (or a group of CGs) as a self contained filesystem (for the purpose of physical allocation) and maintain explicit import/export lists for files that span them. > The problem with having to (effectively) read every inode and > direct block on the disk is really insurmountable, I think. That is why I was suggesting putting them in one (or small number of) contiguous area(s). On a modern ATA100 or better disk you can read a GB in under a minute. Once the data is in-core you can divide up the checking to multiple processors. This is sort of like a distributed graph collection: you only need to worry about graphs that cross a node boundary. Most structures wille contained in one node. Even for UFS it is probably worth dividing fsck in two or more processes, one doing IO, one or more doing computation. > > Typically only a few cyl grps may be inconsistent in case of > > a crash. May be some info about which cyl groups need to be > > checked can be stored so that brute force checking of all > > grps can be avoided. > > This would work well... under laboratory test conditions. In > the field, if the machine has a more general workload (or even > a heterogeneous workload, but with a hell of a lot of files, > like a big mail server), this falls down, as the number of bits > marking "unupdated cylinder groups" becomes large. Possible -- it is one of the ideas I can think of. I'd have to actually model it or simulate it beyond handwaving to know one way or other. May be useful in conjunction with other ideas. > ...AND... The problem is still that you must scan everything on > the disk (practically) to identify the inode or indirect block > that references the block on the cylinder roup in question, and > THAT's the problem. If you knew a small set of CG's, that needed > checked, vs. "all of them", it would *still* bust the cache, which > is what takes all the time. > Assume, on average, each file goes into indirect blocks. On my machine the average file size is 21KB (averaged over 4,000,000 files). Even with 8KB blocksize very few will have indirect blocks. I don't know how typical my file size distribution is but I suspect the average case is probably smaller files (I store lots of datasheets, manuals, databases, PDFs, MP3s, cvs repositories, compressed tars of old stuff). But in any case wouldn't going forward from inodes make more sense? This is like a standard tracing garbage collection algorithm. Blocks that are not reachable are free. Even for a 1 TB system with 8K blocks you need 2^(40-13-3) == 16Mbytes bitmap or some multiple if you want more than 1 bit of state. > The problem is reverse mapping a bit in a CG bitmap to the file > that reference it... 8^p. Why would you want to do that?! > Multithreading fsck would be an incredibly bad idea. It depends on the actual algorithm. > Personally, I recommend using a different FS, if you are going > to create a honing big FS as a single partition. 8-(. There are other issues with smaller partitions. I'd rather have a single logical file system where all the space can be used. If physically it is implemented as a number of smaller systems that is okay. Also note that now people can create big honking files with video streaming at the rate of 2GB per hour and even then you are compromising on quality a bit. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message