From owner-svn-src-head@FreeBSD.ORG  Sun May 27 12:28:45 2012
Return-Path: <owner-svn-src-head@FreeBSD.ORG>
Delivered-To: svn-src-head@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id C1780106566C;
	Sun, 27 May 2012 12:28:45 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from fallbackmx08.syd.optusnet.com.au
	(fallbackmx08.syd.optusnet.com.au [211.29.132.10])
	by mx1.freebsd.org (Postfix) with ESMTP id 1203F8FC14;
	Sun, 27 May 2012 12:28:44 +0000 (UTC)
Received: from mail30.syd.optusnet.com.au (mail30.syd.optusnet.com.au
	[211.29.133.193])
	by fallbackmx08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q4RCShp6004665; Sun, 27 May 2012 22:28:43 +1000
Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232])
	by mail30.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q4RCSWRE022272
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 27 May 2012 22:28:34 +1000
Date: Sun, 27 May 2012 22:28:32 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: "Kenneth D. Merry" <ken@FreeBSD.org>
In-Reply-To: <201205270611.q4R6B9fO029809@svn.freebsd.org>
Message-ID: <20120527174540.Q1061@besplex.bde.org>
References: <201205270611.q4R6B9fO029809@svn.freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org,
	src-committers@FreeBSD.org
Subject: Re: svn commit: r236138 - head/sys/cam/scsi
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
	<svn-src-head.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
	<mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
	<mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 May 2012 12:28:45 -0000

On Sun, 27 May 2012, Kenneth D. Merry wrote:

> Log:
>  Work around a race condition in devfs by changing the way closes
>  are handled in most CAM peripheral drivers that are not handled by
>  GEOM's disk class.

Sigh.  I was trying to get kib to fix last close properly and remove
a previous D_TRACKCLOSE fix.  The geom case is quite broken too.

>  The usual character driver open and close semantics are that the
>  driver gets N open calls, but only one close, when the last caller
>  closes the device.

vfs+devfs doesn't understand its own state well enough to get this
right.  It's impossible for individual drivers to fix this (except
in a few simple cases where access is basically exclusive), since to
do so they would have to understand vfs+devfs better than vfs+devfs
understands itself, and inspect and maybe modify its entrails...

>  CAM peripheral drivers expect that behavior to be honored to the
>  letter, and the CAM peripheral driver code (specifically
>  cam_periph_release_locked_busses()) panics if it is done incorrectly.

The old way to determine whether the device is already open and whether
it should actually be closed in last-close is to use count_dev().  This
almost worked.  It is still used in the ata tape driver, and I used
it in a couple of other pre-geom and de-geomed ata drivers (mainly to
fix eject in acd).  It is more broken than originally mainly in the
case of multiple devfs mounts, due to lossage in aliasing code.  Its
locking is inadequate, especially when it is called from drivers.  But
devfs has no better way to determine the lastness of a close.  Devfs
close uses count_dev(), and just has better locking for its call.  Even
driver close calls for D_TRACKCLOSE can go missing when count_dev()
is lower than it should be.  But most problems occur when count_dev()
is higher than it should be when devfs close calls it.  Last-close
calls go missing in this case.  Drivers can trust it even less.  For
example, if the driver attempts to track last closes using D_TRACKCLOSE,
then the driver close should do less when this is not the last close.
But if count_dev() is higher than it should be (now at the time of the
finalization of the driver close instead of earlier in devfs close)
then a driver that trusts count_dev() will do less than it should.
OTOH, counting things in drivers takes a lot of work and still cannot
work, since wrong vfs+devfs counts result in closes not matching opens
even with D_TRACKCLOSE.  All a driver can do is:
- count things well enough not to panic went vfs+devfs messes up its
   counts
- use D_TRACKCLOSE to handle missing last closes if necessary.  Most
   drivers don't want the complexity of handling all closes, but some
   are more broken than others by missing last closes, and it may be
   necessary to count all closes to avoid the panics.

>  Since devfs has to drop its locks while it calls a driver's close
>  routine, and it does not have a way to delay or prevent open calls
>  while it is calling the close routine, there is a race.

Indeed.  Note that the race window is unbounded and any number of {
open, i/o, close } sequences may occur (and work!) while the original
last-close is running.  This cannot be fixed in general by holding a
lock throughout the original last-close, since that may want to take
aritrarily long and the lock would deadlock it.  You need to be able
to do { open, ioctl, close } sequences to tell the device not to block.

Dropping the locks is also what makes count_dev() more broken in drivers
than in devfs close:
- originally, count_dev() was spelled vcount() and all vnodes for a device
   were aliased to a single vnode.  This vnode was locked by the kernel
   not being preemptive and/or by SMP not existing and/or by Giant
   locking, even after the vnode exlusive lock was dropped in
   specfs/devfs, unless the driver operation sleeps.  Sleeping gives
   normal races, so for example after a driver close sleeps it should
   have called vcount() to detect state changes while sleeping.  This
   was done by few or no drivers.  But vcount was reliable() after
   waking up, even without the vnode exclusive lock, due to the Giant/etc
   locking.
- the aliasing code regressed to allowing multiple vnodes active on a
   device (with devfs, this takes multiple mounts).  The count is still
   per-device.  There is dev locking to protect the count, but no interlocking
   of this with the vnodes.  This would mostly still work without SMP.  But
   with SMP, you can have 2 active vnodes for a device, and 1 process
   holding the vnode (now non-exclusive) lock for stat() and incrementing
   the device count, while another process tries to last-close the device;
   then in devfs close, although both vnodes are locked the count is 1 too
   high due to the stat() and the last close goes missing; worse, if you
   use D_TRACKCLOSE to try to fix this, you may or may not see the count as
   1 too high and have no way to tell if it is, and no way to fix the locking
   (count_dev() will acquire a dev lock, but for working locking you also
   need the lock for your vnode that devfs close dropped, and also the lock
   for the other aliased vnode(s) that stat() may or may not still hold, since
   count_dev() understands less about this than you and doesn't lock them.
   Then after blocking on these locks, you must consider restarting the
   operation if the state changed while you were blocked...).
- without multiple mounts, there are no aliasing problems, but the ones for
   SMP are similar.  Now count_dev() works as correctly as before in devfs
   close (there are still old problems with revoke(2) and vgone(9), and
   complications for the extra count for controlling terminals), but the
   vnode (exclusive) locking works.  But when the vnode lock is released,
   count_dev() becomes hard to use.  Drivers must add vnode locking around
   sections of code that use it, and the only simplification from the above
   aliased case is that there is only one vnode to know too much about.

kib has some fixes for a small part of this.  They involve not only
incrementing the device count for open and not for stat() and other
operations.
   (Now, most syscalls that lock a device's vnode also increase its
   access count.  This access count is abused as an open count.  This
   abuse used to mostly work everything was exclusively locked in
   devfs close and Giant locked elswhere.  In normal operation,
   devfs close is not called while any other process is in the driver
   (since in normal operation, if for example a read() operation has
   not returned from the driver, the file is still open so close
   will not be called for it).  The device may still be open, with
   other processes in the driver for it, via another file.  Last-close
   semantics (no D_TRACKCLOSE) results in these processes not being
   disturbed for close on the inactive file.  If these processes have
   not yet released the vnode lock in the devfs entry to the driver,
   or have re-aquired it, then the locking prevents devfs close
   disturbing them even with D_TRACKCLOSE.

   Then there is abnormal operation, in which things are quite broken.
   revoke(2) and vgone(9) result in devfs close being called even when
   there other processes in the driver, via the same or another vnode.
   Only locking of the vnode being revoked will prevent this.  If things
   are minimally broken, then last-close will occur.  I think last-closes
   go missing when there are multiple active aliases.  If the driver uses
   D_TRACKCLOSE, then it will see at least 1 close per revoke/vgone.  It
   then has the responsibility of untangling the mess.  revoke() should
   probably apply to all the aliases, but general vgone() should not,
   since it is usually for forcibly unmounting 1 devfs instance -- this
   shouldn't affect others.  D_TRACKCLOSE is most broken in this case.
   I think driver close is only called once per revoke/vgone (the count
   is reduced to 1 so that last-close occurs, but the intermediate
   reductions are not noticed.  kib has some fixes for this.)

>  The sequence of events, simplified a bit, is:
>
>  - devfs acquires a lock
>  - devfs checks the reference count, and if it is 1, continues to close.
>  - devfs releases the lock
>
>  - 2nd process open call on the device happens here
>
>  - devfs calls the driver's close routine

Is this for the 2nd process?  It should be, but see below.

>  - devfs acquires a lock
>  - devfs decrements the reference count
>  - devfs releases the lock
>
>  - 2nd process close call on the device happens here
>
>  At the second close, we get a panic in
>  cam_periph_release_locked_busses(), complaining that peripheral
>  has been released when the reference count is already 0.  This is
>  because we have gotten two closes in a row, which should not
>  happen.

This all seems to be correct operation (normal races) except for the
panic.  The closes are not in a row, but are concurrent, and there can
be any number of them concurrently.  This is an unavoidable consequence
of dropping the lock.  I was more concerned about missing and extra
last closes due to vfs miscounting.  I think D_TRACKCLOSE would only
complicate the above case further.

The second close really shouldn't occur unless it had a corresponding
open.  The sequence might be:

   open
   io
   start close
     start open
     (1)
     end open
     io
     close (should notice that previous one is active and do less)
   end close

Or you could make the middle close tell the first one to return, and
wait for it, and then complete (this in turn might race with another
open/io/close and might pass control to the next close...).

However, there are the following bugs:
- the access count is bogusly increased before driver open is called.  The
   race window for this may be aritrarily long, since driver opens may
   sleep.
- the access count may also be 1 extra for concurrent stat(), etc.
Both of these tend to make last-closes go missing, so I don't understand
your problem yet.  D_TRACKCLOSE increases the problem by fixing the missing
last-closes giving driver close calls when for example there is a 3rd
open/close (first process in close; second process blocked in open;
third process got through open by using O_NONBLOCK).  The extra count for
stat() normally won't result in extra concurrent closes, since stat()
doesn't cause closes.

>  The fix is to add the D_TRACKCLOSE flag to the driver's cdevsw, so
>  that we get a close() call for each open().  That does happen
>  reliably, so we can make sure that our reference counts are
>  correct.

It's horribly unreliable.

>  Note that the sa(4) and pt(4) drivers only allow one context
>  through the open routine.  So these drivers aren't exposed to the
>  same race condition.

When I started looking at this bug suite again this year, it was in
response to someone adding D_TRACKCLOSE to sndstat.  This has at least
partially exclusive access.  But it was still broken.  According to
the commit log close() calls went missing due to other processes being
in open() or stat().  As explained above, the problem with stat() can
only occur when there are multiple devfs mounts, so I doubt that it
was the one that actually occurred.  But it is easy for open()s to
get in each others way, and the exclusive access checking doesn't
help at all since it is not done until after aquiring a lock, so the
following bad behaviour like the following apparently happens:

   open
   start read (aquire lock)
     devfs open (increment access the count that is abused as an open count)
     device open (block on lock, even if O_NOBLOCK)
     dev* open... (can be any number of these of these)
   end read (release lock, but scheduler decides to keep running us)
   devfs close (should be last close, but isn't because wrong open count)
     device open (it can run now.  Missing last close => bogusly busy => EBUSY)
     devfs open (decrement access count)
     device open... (return EBUSY, as above)
     devfs open (eventually access count decremented to 0)
   open (still bogusly busy => EBUSY)
   open... (busy forever)

The exclusive access makes the problem worse by getting stuck in an invalid
state.

The fix was to use D_TRACKCLOSE.  I don't like this, but it is safer for
drivers for drivers with pure exclusive access since the correct open
count is very easy to determine (modulo bugs in revoke()) -- it must be 1
throughout the close.  Bugs in revoke give relatively minor complications:
- missing last closes: avoided by D_TRACKCLOSE
- extra last closes: it's often harmless for the hardware to be turned off
   twice.  If not, use the exclusive access condition to avoid this (no
   owner when close is called means an extra close).
There remains the problem that driver open shouldn't block before doing
the exlusive access check.

>  scsi_enc_internal.h,
>  scsi_pass.c,
>  scsi_sg.c:
>  		For these drivers, change the open() routine to
>  		increment the reference count for every open, and
>  		just decrement the reference count in the close.
>
>  		Call cam_periph_release_locked() in some scenarios
>  		to avoid additional lock and unlock calls.
>
>  scsi_pt.c:	Call cam_periph_release_locked() in some scenarios
>  		to avoid additional lock and unlock calls.

Might work except in the revoke/vgone case.  Note that revoke(2) acts on
any character device (if count_dev() on the device is > 1).  For non-
terminal devices, revoke(2) is a mainly foot-shooting interface so it
could be disallowed.  But vgone() on any cdev must be allowed, since it
is need to forcibly unmount devfs.

Here is one way D_TRACKCLOSE breaks in geom when its foot is shot using
revoke().  geom doesn't count closes, but turns off access flags.  In
old disk code, I used open flags for each device.  Device flags are more
immune to the vfs+devfs counting bugs than device counts, since they stick
at 0 and 1 and you can try to reset them using normal operations.  But
geom breaks anyway.

% static int
% g_dev_close(struct cdev *dev, int flags, int fmt, struct thread *td)
% {
% 	struct g_geom *gp;
% 	struct g_consumer *cp;
% 	int error, r, w, e, i;
% 
% 	gp = dev->si_drv1;
% 	cp = dev->si_drv2;
% 	if (gp == NULL || cp == NULL)
% 		return(ENXIO);
% 	g_trace(G_T_ACCESS, "g_dev_close(%s, %d, %d, %p)",
% 	    gp->name, flags, fmt, td);
% 	r = flags & FREAD ? -1 : 0;
% 	w = flags & FWRITE ? -1 : 0;
% #ifdef notyet
% 	e = flags & O_EXCL ? -1 : 0;
% #else
% 	e = 0;
% #endif

Geom tries to track the open flags.  The above converts them from the flags
arg to r/w/e (e ifdefed off).  But when this is called for revoke/vgone,
the flags are unavailable, or at least not passed down.  I think they are
always 0.  Each open file has its own open mode, so for this to work,
vgone would have to find all the open files and extract their open mode,
then pass this down for each, and when D_TRACKCLOSE is not used, do the
call the driver close precisely when last open file is closed.  But it
now doesn't do much like this.  I think it only calls devfs close once
for all the files, and invents an open mode for that.

% 	g_topology_lock();
% 	if (dev->si_devsw == NULL)
% 		error = ENXIO;		/* We were orphaned */
% 	else
% 		error = g_access(cp, r, w, e);
% 	for (i = 0; i < 10 * hz;) {
% 		if (cp->acr != 0 || cp->acw != 0)
% 			break;
%  		if (cp->nstart == cp->nend)
% 			break;
% 		pause("gdevwclose", hz / 10);
% 		i += hz / 10;
% 	}
% 	if (cp->acr == 0 && cp->acw == 0 && cp->nstart != cp->nend) {
% 		printf("WARNING: Final close of geom_dev(%s) %s %s\n",
% 		    gp->name,
% 		    "still has outstanding I/O after 10 seconds.",
% 		    "Completing close anyway, panic may happen later.");

I remember getting this message.  It is right about the panic happening
soon.  But now I don't see how this code was reached -- null flags would
have resulted in cp->acr and cp->acw remaining unchanged as nonzero.

% 	}
% 	g_topology_unlock();
% 	return (error);
% }

The panic prevents further damage from the access flags becoming scrambled
and the driver close being called.  The case of the device file being mounted
on would be most interesting, but I got the panic using something like
open for read followed by revoke on an unused partition.  To involve the
physical driver, all subdevices on the physical device would have to be
revoked, and the panic normally occurs long before getting that far.

The open flags in the previous scisi code may have had a similar fail-
safeness.  I don't see how you fixed panics by removing them.  Actually
trusting them could have given panics if you checked them, but they
weren't used much.  The new code is actually simpler.  I like this, but
it seems to be simpler than possible.  For example:

% Modified: head/sys/cam/scsi/scsi_sg.c
% ==============================================================================
% --- head/sys/cam/scsi/scsi_sg.c	Sun May 27 05:27:47 2012	(r236137)
% +++ head/sys/cam/scsi/scsi_sg.c	Sun May 27 06:11:09 2012	(r236138)
% [... remove open flag and add D_TRACKCLOSE]
% ...
% @@ -415,19 +414,12 @@ sgopen(struct cdev *dev, int flags, int 
% 
%  	softc = (struct sg_softc *)periph->softc;
%  	if (softc->flags & SG_FLAG_INVALID) {
% +		cam_periph_release_locked(periph);
%  		cam_periph_unlock(periph);
% -		cam_periph_release(periph);
%  		return (ENXIO);
%  	}
% 
% -	if ((softc->flags & SG_FLAG_OPEN) == 0) {
% -		softc->flags |= SG_FLAG_OPEN;
% -		cam_periph_unlock(periph);
% -	} else {
% -		/* Device closes aren't symmetrical, fix up the refcount. */
% -		cam_periph_unlock(periph);
% -		cam_periph_release(periph);

The open flag was only used to avoid increasing the refcount much above 1.
cam_periph_release() keeps count, but periph refcounts never worked any
better than vfs+devfs counts, so when cam_periph_release() naively
trusts them and calls camperiphfree(), panics result.

% -	}
% +	cam_periph_unlock(periph);
% 
%  	return (error);
%  }
% @@ -436,18 +428,11 @@ static int
%  sgclose(struct cdev *dev, int flag, int fmt, struct thread *td)
%  {
%  	struct cam_periph *periph;
% -	struct sg_softc *softc;
% 
%  	periph = (struct cam_periph *)dev->si_drv1;
%  	if (periph == NULL)
%  		return (ENXIO);
% 
% -	cam_periph_lock(periph);
% -
% -	softc = (struct sg_softc *)periph->softc;
% -	softc->flags &= ~SG_FLAG_OPEN;
% -
% -	cam_periph_unlock(periph);
%  	cam_periph_release(periph);
% 
%  	return (0);
%

Now we don't maintain the open flag, but depend on vfs+devfs counts more
directly (via devfs close being called enough).  When these don't work,
panics occur.

However, most of the revoke+vgone bugs involve missing calls to devfs
close.  This results in driver counts of open/close via D_TRACKCLOSE
rarely bcoming 0 to soon or negative.  They gradually grow larger, and
panics don't occur until they overflow to negative or 0.

Another problem that I forgot to mention above closes don't match opens
in the case of failed closes.  This is as it should be.  However,
increments of the "open" count match opens (modulo bogus ones for
stat() etc.), while decremements of the "open" count don't match closes.
The difference is due to bogusly increasing the "open" count before
open completes:

 	increment "open" count
 	call driver open
 	if (driver open failed)
 		decremement "open" count
 		return failure to userland
 	else
 		continue with incremented count
 		eventually call close
 		decrement count
 		call driver close if last-close or D_TRACKCLOSE (fragile state)
 		tis had better not fail
 	endif

Drivers need to know that they must clean up for failing opens, since close
will not be called.  The above shows sgopen() doing this.  It just calls
cam_periph_release() for the failing case.  This decremenents the periph's
refcount corresponding to decrementing the "open" count in the above
pseudo-code.  If the open had touched more hardware, then it should do
more to shut the hardware down.  But only if the device is not otherwise
open or being opened.  A device refcount like the periph one is better than
the "open" count for tracking other device activity.

Bruce