From owner-freebsd-hackers Sat Nov 29 21:42:05 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id VAA18207 for hackers-outgoing; Sat, 29 Nov 1997 21:42:05 -0800 (PST) (envelope-from owner-freebsd-hackers) Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id VAA18199 for ; Sat, 29 Nov 1997 21:42:02 -0800 (PST) (envelope-from julian@whistle.com) Received: (from daemon@localhost) by alpo.whistle.com (8.8.5/8.8.5) id VAA03610; Sat, 29 Nov 1997 21:41:00 -0800 (PST) Received: from UNKNOWN(), claiming to be "current1.whistle.com" via SMTP by alpo.whistle.com, id smtpd003608; Sat Nov 29 21:40:52 1997 Date: Sat, 29 Nov 1997 21:38:36 -0800 (PST) From: Julian Elischer To: hackers@freebsd.org cc: Julian Elischer Subject: Stackable storage Alpha release Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Over the last couple of years I have been slowely trying to increase the modularity of freeBSD. One of the things that I really didn't like about UNiX when I first started using it was the 'disconnection' between the contents of /dev and reality. To this end I have been working in the background on DEVFS. A device filesystem which allows (in fact requires) the device drivers to keep the exported picture of available devices in sync with what is actually attached. DEVFS has had it's ups and downs, but one of the difficulties it has had, is in dealing with the current idea of slices and partitions. Particularly, the way in which slices and devices are all mixed together. I finally gave up, and have spent the last 4 weeks or so rewriting The disk storage system. This comes from discussions I have had with PHK at TFS, and others (e.g. Peter Wemm in Perth) over the last few years. Redoing this, which is so basic to the system of course requires that many things be changed. A lot of the changes turn out to be clean-ups. For example, the code for interpretting the boot device as handed in by the bootblocks could be a lot cleaner. Mounting the root filesystem is in general a messy business in freeBSD, and a general cleanup there might make things easier to fix in the future. I have now a set of sample code and patches for freeBSD-current which allow the system to run on a DEVFS, using a primative version of the rewritten storage code. Anyone interested can get a copy of the changes from hub.freebsd.org in: ftp://hub.freebsd.org/pub/scsi/slice.tar.gz unpack the tar file in /sys to get all teh new files, then apply the patch slicediff that it leaves in /sys to get file CHANGES. This is very early code. It can however run on must systems that have scsi or IDE drives. (As long as bad144 is not used) there are the following points to be made: 1/ I have yet to integrate a whole bunch of work that phk has done on this, as I elected to get to a booting and running stage, before I did that. 2/ This code will not support old ESDI drives that cannot report their geometry. (there is support for it, but it is unfinished, and a change in direction is under way after a discussion with Mike Smith.) 3/ You need to change all the entries in /etc/fstab to use their CANONICAL names. e.g. sd1s1a rather that sd1a. There should be an entry of the form: devfs /dev devfs rw 1 1 possibly BEFORE root. 4/ As I write this you need to boot single user, and manually do a 'mount /dev' fsck -p ^D I'm not yet sure why. Rather than just proceding into multi-user mode, I would suggest trying out your devices in single-user mode anyway. 5/ there is a file i386/isa/ide.c this is wd.c with all the old code removed, and some cleanups. I did this just to see how much difference, removing all the old stuff made. 6/ The SCSI disk can still be accessed through the old interface in parallel with the new interface. the IDE disk cannot. If you boot with the root fs mounted from a DEVFS device, you will not be able to do the "mount -u / " from the normal /dev. so root has to be either devfs or not. If you use the "options SLICE", then you will get your root device from in internal kernel-only instance of devfs. 7/ I have no support or reading or writing 'in-core disklabels' yet. fdisk works on the raw device (warning, ANY raw device) and so does disklabel using the -r flag. There is no core-dump support yet in the new stuff. Storage Layering: Here is a brief description of storage layering: Every device exports a single storage interface. This is called a 'slice' Each slice is represented by a "struct slice". The struct has one and only one handler below it, (in this case the driver) and one or zero handlers above it. The slice itself exports a device to the devfs, so even if a raw disk had no handler above it, it would still have one raw device available for use. (e.g. rdsd0). If the slice were divided up using fdisk, so that an MBR was installed, defining some partitions, then the handler abovethe raw slice would be the MBR handler. If howeverm it were divided up using the "dangerously dedicated" mode, with a disklabel defining partitions, then the disklabel handler would be the handler above the slice. Each partitionning of the slice by the handler, produces more 'slices'. They export the identical interface that the lower slice does, so that it might be possible to fdisk an fdisk partiton for example. The only notable diffenence between two slices at different layers is they name. sd0->sd0s1->sd0s1a if we left out the fdisk stage, it would be: sdd0->sd0a and if we divided up an fdisk partition, using an MBR, we might see: sd0->sd0s1->sd0s1s1->sd0s1s1a (assuming we then disklabeled it) it is up to any handler to define how many slices it mutiplexes to above and below, but the slices themselves cannot multiplex. Thsi what eventually appears is a sandwich of: slices (sd0s1a) (sd0s1b) | | handler (disklabel) | slices (sd0s1) (sd0s2) (vn0a) (vn0b) | | | | handler ( MBR ) (disklabel) | | slices (sd0) (wd0) (vn0) | | | handler/driver (sd.c) (wd.c) (vn.c) There would be other layers eventually. e.g. A layer to do bad-block mapping (cough, I forgot to say I was't doing that yet?) A layer to do CCD or RAID type things. The 'slice' structure is well known, in that all handlers know all the fields, and can access them. This provides a 'mailbox' (SIC) for handlers to identify and communicate with each other, without needing too much knowledge about each other. The handlers supply an array of methods that they support, either from calls from above, or calls from below. I am still cleaning up the way that handlers invoke each-other's methods so be kind.. :) Comments are not only welcome, they are sought! If you repartition a disk, the entries in /dev should dynamically change. The present version however maintains consitency, by disallowing opens on lever level devices while higher level devices ar eopen, so you cannot at this time repartition your root disk while running on it. (I'm not convinced this is a good thing, but I should support it). julian I hope I haven't left anything out.. BTW SOS and Luigi.. thw patch includes DEVFS fixes for your device drivers.