From owner-svn-src-head@FreeBSD.ORG Sun May 27 12:28:45 2012 Return-Path: Delivered-To: svn-src-head@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C1780106566C; Sun, 27 May 2012 12:28:45 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx08.syd.optusnet.com.au (fallbackmx08.syd.optusnet.com.au [211.29.132.10]) by mx1.freebsd.org (Postfix) with ESMTP id 1203F8FC14; Sun, 27 May 2012 12:28:44 +0000 (UTC) Received: from mail30.syd.optusnet.com.au (mail30.syd.optusnet.com.au [211.29.133.193]) by fallbackmx08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q4RCShp6004665; Sun, 27 May 2012 22:28:43 +1000 Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail30.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q4RCSWRE022272 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 27 May 2012 22:28:34 +1000 Date: Sun, 27 May 2012 22:28:32 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: "Kenneth D. Merry" In-Reply-To: <201205270611.q4R6B9fO029809@svn.freebsd.org> Message-ID: <20120527174540.Q1061@besplex.bde.org> References: <201205270611.q4R6B9fO029809@svn.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org Subject: Re: svn commit: r236138 - head/sys/cam/scsi X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 May 2012 12:28:45 -0000 On Sun, 27 May 2012, Kenneth D. Merry wrote: > Log: > Work around a race condition in devfs by changing the way closes > are handled in most CAM peripheral drivers that are not handled by > GEOM's disk class. Sigh. I was trying to get kib to fix last close properly and remove a previous D_TRACKCLOSE fix. The geom case is quite broken too. > The usual character driver open and close semantics are that the > driver gets N open calls, but only one close, when the last caller > closes the device. vfs+devfs doesn't understand its own state well enough to get this right. It's impossible for individual drivers to fix this (except in a few simple cases where access is basically exclusive), since to do so they would have to understand vfs+devfs better than vfs+devfs understands itself, and inspect and maybe modify its entrails... > CAM peripheral drivers expect that behavior to be honored to the > letter, and the CAM peripheral driver code (specifically > cam_periph_release_locked_busses()) panics if it is done incorrectly. The old way to determine whether the device is already open and whether it should actually be closed in last-close is to use count_dev(). This almost worked. It is still used in the ata tape driver, and I used it in a couple of other pre-geom and de-geomed ata drivers (mainly to fix eject in acd). It is more broken than originally mainly in the case of multiple devfs mounts, due to lossage in aliasing code. Its locking is inadequate, especially when it is called from drivers. But devfs has no better way to determine the lastness of a close. Devfs close uses count_dev(), and just has better locking for its call. Even driver close calls for D_TRACKCLOSE can go missing when count_dev() is lower than it should be. But most problems occur when count_dev() is higher than it should be when devfs close calls it. Last-close calls go missing in this case. Drivers can trust it even less. For example, if the driver attempts to track last closes using D_TRACKCLOSE, then the driver close should do less when this is not the last close. But if count_dev() is higher than it should be (now at the time of the finalization of the driver close instead of earlier in devfs close) then a driver that trusts count_dev() will do less than it should. OTOH, counting things in drivers takes a lot of work and still cannot work, since wrong vfs+devfs counts result in closes not matching opens even with D_TRACKCLOSE. All a driver can do is: - count things well enough not to panic went vfs+devfs messes up its counts - use D_TRACKCLOSE to handle missing last closes if necessary. Most drivers don't want the complexity of handling all closes, but some are more broken than others by missing last closes, and it may be necessary to count all closes to avoid the panics. > Since devfs has to drop its locks while it calls a driver's close > routine, and it does not have a way to delay or prevent open calls > while it is calling the close routine, there is a race. Indeed. Note that the race window is unbounded and any number of { open, i/o, close } sequences may occur (and work!) while the original last-close is running. This cannot be fixed in general by holding a lock throughout the original last-close, since that may want to take aritrarily long and the lock would deadlock it. You need to be able to do { open, ioctl, close } sequences to tell the device not to block. Dropping the locks is also what makes count_dev() more broken in drivers than in devfs close: - originally, count_dev() was spelled vcount() and all vnodes for a device were aliased to a single vnode. This vnode was locked by the kernel not being preemptive and/or by SMP not existing and/or by Giant locking, even after the vnode exlusive lock was dropped in specfs/devfs, unless the driver operation sleeps. Sleeping gives normal races, so for example after a driver close sleeps it should have called vcount() to detect state changes while sleeping. This was done by few or no drivers. But vcount was reliable() after waking up, even without the vnode exclusive lock, due to the Giant/etc locking. - the aliasing code regressed to allowing multiple vnodes active on a device (with devfs, this takes multiple mounts). The count is still per-device. There is dev locking to protect the count, but no interlocking of this with the vnodes. This would mostly still work without SMP. But with SMP, you can have 2 active vnodes for a device, and 1 process holding the vnode (now non-exclusive) lock for stat() and incrementing the device count, while another process tries to last-close the device; then in devfs close, although both vnodes are locked the count is 1 too high due to the stat() and the last close goes missing; worse, if you use D_TRACKCLOSE to try to fix this, you may or may not see the count as 1 too high and have no way to tell if it is, and no way to fix the locking (count_dev() will acquire a dev lock, but for working locking you also need the lock for your vnode that devfs close dropped, and also the lock for the other aliased vnode(s) that stat() may or may not still hold, since count_dev() understands less about this than you and doesn't lock them. Then after blocking on these locks, you must consider restarting the operation if the state changed while you were blocked...). - without multiple mounts, there are no aliasing problems, but the ones for SMP are similar. Now count_dev() works as correctly as before in devfs close (there are still old problems with revoke(2) and vgone(9), and complications for the extra count for controlling terminals), but the vnode (exclusive) locking works. But when the vnode lock is released, count_dev() becomes hard to use. Drivers must add vnode locking around sections of code that use it, and the only simplification from the above aliased case is that there is only one vnode to know too much about. kib has some fixes for a small part of this. They involve not only incrementing the device count for open and not for stat() and other operations. (Now, most syscalls that lock a device's vnode also increase its access count. This access count is abused as an open count. This abuse used to mostly work everything was exclusively locked in devfs close and Giant locked elswhere. In normal operation, devfs close is not called while any other process is in the driver (since in normal operation, if for example a read() operation has not returned from the driver, the file is still open so close will not be called for it). The device may still be open, with other processes in the driver for it, via another file. Last-close semantics (no D_TRACKCLOSE) results in these processes not being disturbed for close on the inactive file. If these processes have not yet released the vnode lock in the devfs entry to the driver, or have re-aquired it, then the locking prevents devfs close disturbing them even with D_TRACKCLOSE. Then there is abnormal operation, in which things are quite broken. revoke(2) and vgone(9) result in devfs close being called even when there other processes in the driver, via the same or another vnode. Only locking of the vnode being revoked will prevent this. If things are minimally broken, then last-close will occur. I think last-closes go missing when there are multiple active aliases. If the driver uses D_TRACKCLOSE, then it will see at least 1 close per revoke/vgone. It then has the responsibility of untangling the mess. revoke() should probably apply to all the aliases, but general vgone() should not, since it is usually for forcibly unmounting 1 devfs instance -- this shouldn't affect others. D_TRACKCLOSE is most broken in this case. I think driver close is only called once per revoke/vgone (the count is reduced to 1 so that last-close occurs, but the intermediate reductions are not noticed. kib has some fixes for this.) > The sequence of events, simplified a bit, is: > > - devfs acquires a lock > - devfs checks the reference count, and if it is 1, continues to close. > - devfs releases the lock > > - 2nd process open call on the device happens here > > - devfs calls the driver's close routine Is this for the 2nd process? It should be, but see below. > - devfs acquires a lock > - devfs decrements the reference count > - devfs releases the lock > > - 2nd process close call on the device happens here > > At the second close, we get a panic in > cam_periph_release_locked_busses(), complaining that peripheral > has been released when the reference count is already 0. This is > because we have gotten two closes in a row, which should not > happen. This all seems to be correct operation (normal races) except for the panic. The closes are not in a row, but are concurrent, and there can be any number of them concurrently. This is an unavoidable consequence of dropping the lock. I was more concerned about missing and extra last closes due to vfs miscounting. I think D_TRACKCLOSE would only complicate the above case further. The second close really shouldn't occur unless it had a corresponding open. The sequence might be: open io start close start open (1) end open io close (should notice that previous one is active and do less) end close Or you could make the middle close tell the first one to return, and wait for it, and then complete (this in turn might race with another open/io/close and might pass control to the next close...). However, there are the following bugs: - the access count is bogusly increased before driver open is called. The race window for this may be aritrarily long, since driver opens may sleep. - the access count may also be 1 extra for concurrent stat(), etc. Both of these tend to make last-closes go missing, so I don't understand your problem yet. D_TRACKCLOSE increases the problem by fixing the missing last-closes giving driver close calls when for example there is a 3rd open/close (first process in close; second process blocked in open; third process got through open by using O_NONBLOCK). The extra count for stat() normally won't result in extra concurrent closes, since stat() doesn't cause closes. > The fix is to add the D_TRACKCLOSE flag to the driver's cdevsw, so > that we get a close() call for each open(). That does happen > reliably, so we can make sure that our reference counts are > correct. It's horribly unreliable. > Note that the sa(4) and pt(4) drivers only allow one context > through the open routine. So these drivers aren't exposed to the > same race condition. When I started looking at this bug suite again this year, it was in response to someone adding D_TRACKCLOSE to sndstat. This has at least partially exclusive access. But it was still broken. According to the commit log close() calls went missing due to other processes being in open() or stat(). As explained above, the problem with stat() can only occur when there are multiple devfs mounts, so I doubt that it was the one that actually occurred. But it is easy for open()s to get in each others way, and the exclusive access checking doesn't help at all since it is not done until after aquiring a lock, so the following bad behaviour like the following apparently happens: open start read (aquire lock) devfs open (increment access the count that is abused as an open count) device open (block on lock, even if O_NOBLOCK) dev* open... (can be any number of these of these) end read (release lock, but scheduler decides to keep running us) devfs close (should be last close, but isn't because wrong open count) device open (it can run now. Missing last close => bogusly busy => EBUSY) devfs open (decrement access count) device open... (return EBUSY, as above) devfs open (eventually access count decremented to 0) open (still bogusly busy => EBUSY) open... (busy forever) The exclusive access makes the problem worse by getting stuck in an invalid state. The fix was to use D_TRACKCLOSE. I don't like this, but it is safer for drivers for drivers with pure exclusive access since the correct open count is very easy to determine (modulo bugs in revoke()) -- it must be 1 throughout the close. Bugs in revoke give relatively minor complications: - missing last closes: avoided by D_TRACKCLOSE - extra last closes: it's often harmless for the hardware to be turned off twice. If not, use the exclusive access condition to avoid this (no owner when close is called means an extra close). There remains the problem that driver open shouldn't block before doing the exlusive access check. > scsi_enc_internal.h, > scsi_pass.c, > scsi_sg.c: > For these drivers, change the open() routine to > increment the reference count for every open, and > just decrement the reference count in the close. > > Call cam_periph_release_locked() in some scenarios > to avoid additional lock and unlock calls. > > scsi_pt.c: Call cam_periph_release_locked() in some scenarios > to avoid additional lock and unlock calls. Might work except in the revoke/vgone case. Note that revoke(2) acts on any character device (if count_dev() on the device is > 1). For non- terminal devices, revoke(2) is a mainly foot-shooting interface so it could be disallowed. But vgone() on any cdev must be allowed, since it is need to forcibly unmount devfs. Here is one way D_TRACKCLOSE breaks in geom when its foot is shot using revoke(). geom doesn't count closes, but turns off access flags. In old disk code, I used open flags for each device. Device flags are more immune to the vfs+devfs counting bugs than device counts, since they stick at 0 and 1 and you can try to reset them using normal operations. But geom breaks anyway. % static int % g_dev_close(struct cdev *dev, int flags, int fmt, struct thread *td) % { % struct g_geom *gp; % struct g_consumer *cp; % int error, r, w, e, i; % % gp = dev->si_drv1; % cp = dev->si_drv2; % if (gp == NULL || cp == NULL) % return(ENXIO); % g_trace(G_T_ACCESS, "g_dev_close(%s, %d, %d, %p)", % gp->name, flags, fmt, td); % r = flags & FREAD ? -1 : 0; % w = flags & FWRITE ? -1 : 0; % #ifdef notyet % e = flags & O_EXCL ? -1 : 0; % #else % e = 0; % #endif Geom tries to track the open flags. The above converts them from the flags arg to r/w/e (e ifdefed off). But when this is called for revoke/vgone, the flags are unavailable, or at least not passed down. I think they are always 0. Each open file has its own open mode, so for this to work, vgone would have to find all the open files and extract their open mode, then pass this down for each, and when D_TRACKCLOSE is not used, do the call the driver close precisely when last open file is closed. But it now doesn't do much like this. I think it only calls devfs close once for all the files, and invents an open mode for that. % g_topology_lock(); % if (dev->si_devsw == NULL) % error = ENXIO; /* We were orphaned */ % else % error = g_access(cp, r, w, e); % for (i = 0; i < 10 * hz;) { % if (cp->acr != 0 || cp->acw != 0) % break; % if (cp->nstart == cp->nend) % break; % pause("gdevwclose", hz / 10); % i += hz / 10; % } % if (cp->acr == 0 && cp->acw == 0 && cp->nstart != cp->nend) { % printf("WARNING: Final close of geom_dev(%s) %s %s\n", % gp->name, % "still has outstanding I/O after 10 seconds.", % "Completing close anyway, panic may happen later."); I remember getting this message. It is right about the panic happening soon. But now I don't see how this code was reached -- null flags would have resulted in cp->acr and cp->acw remaining unchanged as nonzero. % } % g_topology_unlock(); % return (error); % } The panic prevents further damage from the access flags becoming scrambled and the driver close being called. The case of the device file being mounted on would be most interesting, but I got the panic using something like open for read followed by revoke on an unused partition. To involve the physical driver, all subdevices on the physical device would have to be revoked, and the panic normally occurs long before getting that far. The open flags in the previous scisi code may have had a similar fail- safeness. I don't see how you fixed panics by removing them. Actually trusting them could have given panics if you checked them, but they weren't used much. The new code is actually simpler. I like this, but it seems to be simpler than possible. For example: % Modified: head/sys/cam/scsi/scsi_sg.c % ============================================================================== % --- head/sys/cam/scsi/scsi_sg.c Sun May 27 05:27:47 2012 (r236137) % +++ head/sys/cam/scsi/scsi_sg.c Sun May 27 06:11:09 2012 (r236138) % [... remove open flag and add D_TRACKCLOSE] % ... % @@ -415,19 +414,12 @@ sgopen(struct cdev *dev, int flags, int % % softc = (struct sg_softc *)periph->softc; % if (softc->flags & SG_FLAG_INVALID) { % + cam_periph_release_locked(periph); % cam_periph_unlock(periph); % - cam_periph_release(periph); % return (ENXIO); % } % % - if ((softc->flags & SG_FLAG_OPEN) == 0) { % - softc->flags |= SG_FLAG_OPEN; % - cam_periph_unlock(periph); % - } else { % - /* Device closes aren't symmetrical, fix up the refcount. */ % - cam_periph_unlock(periph); % - cam_periph_release(periph); The open flag was only used to avoid increasing the refcount much above 1. cam_periph_release() keeps count, but periph refcounts never worked any better than vfs+devfs counts, so when cam_periph_release() naively trusts them and calls camperiphfree(), panics result. % - } % + cam_periph_unlock(periph); % % return (error); % } % @@ -436,18 +428,11 @@ static int % sgclose(struct cdev *dev, int flag, int fmt, struct thread *td) % { % struct cam_periph *periph; % - struct sg_softc *softc; % % periph = (struct cam_periph *)dev->si_drv1; % if (periph == NULL) % return (ENXIO); % % - cam_periph_lock(periph); % - % - softc = (struct sg_softc *)periph->softc; % - softc->flags &= ~SG_FLAG_OPEN; % - % - cam_periph_unlock(periph); % cam_periph_release(periph); % % return (0); % Now we don't maintain the open flag, but depend on vfs+devfs counts more directly (via devfs close being called enough). When these don't work, panics occur. However, most of the revoke+vgone bugs involve missing calls to devfs close. This results in driver counts of open/close via D_TRACKCLOSE rarely bcoming 0 to soon or negative. They gradually grow larger, and panics don't occur until they overflow to negative or 0. Another problem that I forgot to mention above closes don't match opens in the case of failed closes. This is as it should be. However, increments of the "open" count match opens (modulo bogus ones for stat() etc.), while decremements of the "open" count don't match closes. The difference is due to bogusly increasing the "open" count before open completes: increment "open" count call driver open if (driver open failed) decremement "open" count return failure to userland else continue with incremented count eventually call close decrement count call driver close if last-close or D_TRACKCLOSE (fragile state) tis had better not fail endif Drivers need to know that they must clean up for failing opens, since close will not be called. The above shows sgopen() doing this. It just calls cam_periph_release() for the failing case. This decremenents the periph's refcount corresponding to decrementing the "open" count in the above pseudo-code. If the open had touched more hardware, then it should do more to shut the hardware down. But only if the device is not otherwise open or being opened. A device refcount like the periph one is better than the "open" count for tracking other device activity. Bruce