Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 May 2006 07:10:17 GMT
From:      Bruce Evans <bde@zeta.org.au>
To:        freebsd-bugs@FreeBSD.org
Subject:   Re: kern/97665: hang in sio driver
Message-ID:  <200605230710.k4N7AHQK017139@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/97665; it has been noted by GNATS.

From: Bruce Evans <bde@zeta.org.au>
To: Miles Nordin <carton@Ivy.NET>
Cc: FreeBSD-gnats-submit@freebsd.org, freebsd-bugs@freebsd.org
Subject: Re: kern/97665: hang in sio driver
Date: Tue, 23 May 2006 17:05:10 +1000 (EST)

 On Mon, 22 May 2006, Miles Nordin wrote:
 
 >> Description:
 >> How-To-Repeat:
 > freebsd's serial driver seems to hang up a lot.  processes get stuck
 > in uninterruptible sleep (don't respond to 'kill -9'), and i can
 > release them by, say, power-cycling a modem.
 >
 > try this:
 >
 > first, get a serial device that holds CTS low.
                ^ external (?)
 >
 > # stty crtscts < /dev/ttyd0.lock
 
 Don't normally do this.  It doesn't seem to be necessary to demonstrate
 the bug here, and tends to break programs that actually understand
 crtscts.  E.g., locking crtscts on for a serious mouse breaks the X11
 mouse driver turning off crtscts, and then the port hangs in a way
 related to this PR.
 
 > # stty crtscts < /dev/ttyd0.init
 
 This is necessary to demonstrate the bug here, since support for setting
 crtscts in the program was lost when cu was broken by replacing Taylor cu
 (part of uucp) by native cu (part of tip) many years ago.  I forget if
 Taylor cu clobbered the system (initial) default crtscts setting.
 
 > # cu -l ttyd0 -s 9600
 
 Another thing lost in cu is defaulting to using the system default
 line speed.  cu now defaults to its hard-coded default speed of 9600.
 This can be fixed by locking the system default speed.
 
 > Connected.
 > asdfasd;ljk;bouns;douahf
 > ~.
 > [EOT]
 > ^T
 > load: 0.00  cmd: cu 42104 [ttywai] 0.00u 0.03s 0% 752k
 >
 > now, open another window, and try 'kill -9 42104'.  doesn't work.
 
 Did you connect it to an external device that holds CTS low (or just
 to nothing) so that it blocks waiting for CTS, and then on "~." it
 blocks in exit waiting for the output to drain?  In this setup, the
 write() happens to complete since it is small (it goes to driver
 buffers, and write() somewhat bogusly returns success although the
 output hasn't actually all gone out (and with CTS blocking it, _none_
 has gone out)).  This makes it possible for cu to read the "~." and
 clean up and exit.  Like most programs, cu has low-quality cleanup
 before exit.  If it actually cared about output going out, it would
 use one of the following methods:
 A: for serial devices, use tcdrain().  Maybe use a timeout and tcflush()
     to avoid endless waits.
 B: for general devices, close() the device and actually check the return
     status.  This provides less control.  For tty devices, close()
     essentially does tcdrain() in the kernel, but interrupting this
     doesn't work quite right (it causes the output to be flushed and
     no error to be returned by close()).
 Programs that don't care about their output going out use the following
 method:
 C: just exit().  This pushes the close() to the kernel.  It has all the
     disadvantages of method (B), plus the following:
     C1: there is no way to act on the result of close().
     C2: there is no way to send a signal to an exiting process, or at
         least no way for one to affect close().
 
 (C) and (C2) cause the symptoms reported in the above part of this PR.
 (C2) is the only bug here.  It has nothing to do with sio or even tty
 drivers generally.  All tty drivers are required to wait for output
 to drain in close(), and this is handled in the tty layer (function
 ttywait()).
 
 To work around bug (C2) and also endless waits in tcdrain(), use the
 drainwait ioctl (e.g., comcontrol /dev/ttyd0 drainwait <n>).  This
 defaults to 180 seconds, so most hangs on [ttywai] aren't actually
 endless (they just seem to be).  I normally use the default, but change
 to 1 second when something hangs (which happens fairly often since I
 have lots of unconnected ports with crtscts initially on and sometimes
 forget which ports are connected).  Note that setting a too-small
 drainwait affects tcdrain() and thus may break normal output.  tcdrain()
 seems to do the right thing (it returns an error after the timeout,
 and presumably doesn't flush the output), but programs wouldn't expect
 tcdrain() to fail due to a timeout.
 
 > now for real fun try this:
 >
 > # ls -l /dev/cuad0
 >
 > provided you type that command for the first time while the
 > serial port is hung, you will hang devfs which will pretty soon hang
 > the whole goddamned machine.  once cuad0 node is instantiated, that
 > vulnerability no longer exists.
 
 Devfs has lots of bugs, but I don't work on versions of FreeBSD that
 have it and can only guess its bug here.  Apparently it does something
 like an open() on /dev/cuad0.  open() has side effects for all devices
 so ls shouldn't go anywhere near it.  The side effects are particularly
 large for serial ports.  open()s of cuad0 must block waiting for ttyd0
 to go away.  Even reopens of ttyd0 may take arbitrarily long -- they
 wait for DTR to be held on long enough, and you can make the wait
 arbitrarily long using the dtrwait ioctl (comcontrol /dev/ttyd0 dtrwait <n>).
 The default for dtrwait is 3 seconds so waiting for it is normally not
 as noticeable as waiting for drainwait.
 
 > after some very long timeout on the order of minutes, the system may
 > recover itself.
 
 Always after 3 minutes?
 
 > Fix:
 > IMHO, a process should always respond to 'kill -9' no matter _what_ SIO
 > is doing, waiting for carrier, with data in the output buffer waiting
 > for CTS to assert itself, whatever, period.  I shouldn't have the process
 > table cluttered with anything that can be removed only by changing the logic
 > state of some serial port pin.  serial is not a SCSI port---it's highly
 > public.
 
 It's no more public.  Both are controlled by device permissions and
 accesses to control ports are not normally granted to everyone.  Users
 of serial ports can only set crtscts if crtscts is not locked off, and
 the sysadmin can lock it off for hardwware that doesn't support it.
 The sysadmin can also set drainwait to limit the effects of this
 bug when it is permitted to occur.
 
 > definitely sio activity should not be able to hang devfs.
 
 A system-wide hang is much more serious.
 
 Other bugs near this area:
 - the vfs layer doesn't count devices sleeping in open() properly, so the
    count of activity on tty devices vs the cua devices can get messed up
    and it isn't possible to open devices in one of these classes until
    all the others reach close().  This was fixed in FreeBSD-1, but the fix
    was lost in FreeBSD-2.
 - it is possible for open()/close() to (double)cross close().  This is
    necessary for reducing drainwait to be possible (since an open() is
    required to do an ioctl()), but it causes races that aren't handled
    properly (ones which seem to be harmful in theory but harmless in
    practice).  sioopen() is or was very careful about concurrent opens,
    but sioclose() isn't so careful.  It basically assumes that concurrent
    closes aren't possible (last-close semantics are supposed to prevent
    concurrent closes).  However, when close() sleeps, as it often does
    for draining serial devices, the following can occur:
    1. close() sleeps in thread 1
    2. open() completes in thread 2.  The state changes for this may mess
       up the state for thread 1 (but I think there are only problems with
       the hardware state).
    3. ioctl() and even i/o in thread 2.  Input might even work, but output
       would block for the same reason as thread 1, unless thread 1 races
       thread 2 and completes while thread 2 is active (then the completion
       may change the state of the hardware and break i/o in thread 1).
    4. close() completes in thread 2.  It doesn't normally block, but I'm not
       sure how it manages this since the tty struct is common.  State
       changes in this should prevent the output in thread 1 from completing
       in the normal way.  Note that the device close() is only reachable due
       to the poor open/close counting in the vfs layer.  Last-close semantics
       is supposed to prevent multiple devices in close(), but the counts are
       decremented before calling close(); thus there can be any number of
       threads sleeping in close() and 1 thread at a time clobbering the
       state for the sleeping threads.
    The fix would involve backing out of close() if the device was reopened
    while we were sleeping in close() and is still open, and not completing
    close() if another thread is already sleeping in close() (i.e., change
    some last-closes into non-last ones and vice versa, and count things
    better so that this is possible).
 
 Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200605230710.k4N7AHQK017139>