Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 10 Apr 2000 08:24:53 -0700
From:      Julian Elischer <julian@elischer.org>
To:        Poul-Henning Kamp <phk@freebsd.org>
Cc:        arch@freebsd.org
Subject:   Re: BUF/BIO roadmap.
Message-ID:  <38F1F245.2781E494@elischer.org>
References:  <23546.955367727@critter.freebsd.dk>

next in thread | previous in thread | raw e-mail | index | archive | help
Poul-Henning Kamp wrote:
> 
> Core asked me to produce a short document about what I am trying to
> do with the struct buf / struct bio and all that jazz.
> 
> This paper can be found here: http://phk.freebsd.dk/Bio/bio.ps
> 
> There are two parts to it:
> 
>   1.  The argumentation for splitting struct bio out from struct buf
> 
>   2.  A road map for the stackable BIO system.

I agree with all that is said...
some coments.. I would like to see these issues addressed:

When I did this I didn't try and separate the buf into two structures,
but rather, introduced a structure called an iorequest (ioreq).
This sructure was only ppresent in the disk stacking layer to limit
changes elsewhere in the kernel for stability reasons and to allow 
the kernel to still be compiled without the new structure. (this 
was a mistake).

The top level strategy routine allocated one of these and extracted 
the needed fields out of the struct buf. The aim was eventually,
to do a cleanup on struct buf when ioreq had become upiquitous
in drivers, and remove all the io related fields.
different approach, same result.

From memory, I managed to do this without having a pbklno and a 
lblkno.(I noticed that you still have both in the document) in the 
bio struct (my ioreq) so I wonder whether it is really needed.
(I may have got the names wrong as the document is not in front of me)

The devices and stacking had exactly the same semantics as you
suggest (re: refusing to open clashing devices etc.) so I
agree totally with that. You make no mention as to how one maps
an arbitrarily stacked set of partitions into minor numbers.
(i.e. what is the minor number of a partition called da2s1de

[scsi-disk(2)]----[MBR(1)]---[BSD(d)]---[BSD(e)]--device_node  
(where someone has put a BSD partition within a BSD partition).
(should be legal right?) I handled this by allocating minors
on the fly and making DEVFS a required item. What is your suggestion?

The issue of only physically mapping bufs is not related.

Unless we get BSDI style interrupt threads, the idea of propogating
up 'probe' operations cannot be done safely (believe me I looked
at this a lot). I even had it running that way. It works but it's
not guaranteed safe.. (never bit me but statistically it would
eventually bite someone). The solution I eventualy came to,, but was 
never given the opportunity to check-in, was to immediatly propogate
the 'arrival' events to a separate kernel daemon, called devd, and 
queue devd to be scheduled.

The events would be queued for devd's attention. Each event 
consisted of information and a function to run. Thus each driver
would scchedule that one of it's functions would be run at kernel
process level (where sleeping on IO is possible), and that function
would be responsible for initiating the probes for partition types etc.

The Other problem I faced was the possibility that
when a low level device was open, the user might re-write the
structures that defined some upper layer devices. My solution for 
that was that on the close() of the lower level device, all 
the upper level devices were asked to verify that they were 
still valid. This was a variant of the probe() call that used a lot
of the same code in some cases. The 'verify' request propogated up
(in the context of the closing process so devd was not involved),
and on encountering a newly invalidated partition, it was switched into
an 'invalidate' request which was further propogated up to any higher
layer devices. One result of this was that as all 'close' operations on
direct devices caused reprobing (effectively), all that devd
had to do to probe a device when asked, was {open(); close(); }
on the lowest level device.

As an example of how this worked..closing /dev/da0 would
ask the MBR node to reexamine the MBR. It in turn would pass up
'invalidate' events to any (old) partitions that were not the same, 
and 'verify' events to any partitions that appeared the same
from at the MBR level. Obviously the higher level nodes could not have
been open or the open of /dev/da0 would not have been possible. With 
DEVFS it might actually be posible that openning /dev/da0
would actually instantaneously invalidate all the higher nodes
(which would remove them from the devfs.. (you can't open them 
now anyhow)), and you would just allow them to be rebuilt from scratch
on the close() anyhow. (this idea may be a bit radical for some).


In the downward propogation of open() and close() calls, we need to
propogate independently open-for-read()  and open-for-read+write()
If one partition is already open for read and the another is openned 
for read/write, then the 'downstream' device needs to be upgraded for
read/write. However if the read/write upper device is then closed,
the downstream device should be downgraded to r/o. This cannot be 
guaranteed under present semantics as only the last close is passed
to the device layer. My preliminary suggestion is the addition of 
a method accessmode(), to the cdevsw entry for a device, that
is called before/after/instead of the open() and close() calls IF 
THE DRIVER SUPPORTS IT, that fully propogates this information.

Justin Gibbs suggested that this call should also allow the driver 
to know WHO is making hte call, and that it should also be called
when a 'fork()' or dup() call is made so that 
the driver an keep accurate accountings of  what modes are presently
in use and which are not. I am including the whole stacking framework
under name "driver" here. This needs further discussion
and I think there may be better solutions.

Some upward propogating events such as revocation may be safe at 
the interupt level, however I think that a general mechanism such
as I implemented with devd can be a win in the long run as they 
can be proven to be safe, with the only point of danger being
limited  to the code that queues the action request, which
can be kept small enough to be rigorously analysed.

As I have said before, It's a pitty that NIH removed my code
but having this pretty much identical code added certainly
is the right thing even if it is 2 years later than 
it would have been. (* had to say it you know..)

> 
> Barring any competent objections, the patch still up for review
> at http://phk.freebsd.dk/misc will be committed and work progress
> according to the roadmap.

I haven't looked again but I assume this is still the patch I
agreed to before..

> 
> --
> Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
> phk@FreeBSD.ORG         | TCP/IP since RFC 956
> FreeBSD coreteam member | BSD since 4.3-tahoe
> Never attribute to malice what can adequately be explained by incompetence.
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message

-- 
      __--_|\  Julian Elischer
     /       \ julian@elischer.org
    (   OZ    ) World tour 2000
---> X_.---._/  presently in:  Perth
            v




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?38F1F245.2781E494>