Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 9 Nov 2003 02:30:16 -0800 (PST)
From:      Don Lewis <truckman@FreeBSD.org>
To:        bde@zeta.org.au
Cc:        current@FreeBSD.org
Subject:   Re: serial console oddity
Message-ID:  <200311091030.hA9AUGeF063359@gw.catspoiler.org>
In-Reply-To: <20031109192046.L2537@gamplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On  9 Nov, Bruce Evans wrote:
> On Sat, 8 Nov 2003, Don Lewis wrote:
> 
>> I've been seeing some wierd things for many months when using a serial
>> console on my -CURRENT box.  I finally had a chance to take a closer
>> look today.
>>
>> It looks like the problem is some sort of interference between kernel
>> output to the console and userland writes to /dev/console.  I typically
>> see syslogd output to the console get corrupted.  Each message that
>> syslogd writes seems to get truncated or otherwise corrupted.  The most
>> common thing I see is that each syslog message is reduced to a space and
>> the first character of the month, or sometimes just a space, or
>> sometimes nothing at all.
> 
> This is (at least primarily) a longstanding bug in ttymsg().  It uses
> nonblocking mode so that it doesn't block in write() or close().  For
> the same reason, it doesn't wait for output to drain before close().
> If the close happens to be the last one on the device, this causes any
> data buffered in the tty and lower software layers to be discarded
> cleanly and any data in lower hardware layers to by discarded in a
> driver plus hardware-dependent way (usually not so cleanly, especially
> for the character being transmitted).

I didn't think of a flush on close problem because I thought syslogd
always kept the console open.

>> This is totally consistent until I "kill
>> -HUP" syslogd, which I believe causes syslogd to close and open
>> /dev/console, after which the syslog output appears correct on the
>> console. When the syslogd output is being corrupted, I can cat a file to
>> /dev/console and the output appears to be correct.
> 
> When I debugged this, syslogd didn't seem to keep the console open,
> so the open()/close() in ttymsg() always caused the problem.  I didn't
> notice killing syslogd makes a difference.  Perhaps it helps due to a
> missing close.  Holding the console open may be a workaround or even
> the correct fix.  It's not clear where this should be done (should all
> clients of ttymsg() do it?).  Running getty on the console or on the
> underlying tty device should do it accidentally.

It looks to me like syslogd keeps the console open in addition to the
open()/close() in ttymsg().  cfline() calls open() on anything that
begins with '/' and calls isatty() to figure out whether it should set
the type to F_CONSOLE, F_TTY, or F_FILE, and init() closes the file
descriptor for all of these when syslogd is HUPed.

I wonder if the console descriptor is getting revoked ...

>> I truss'ed syslogd, and it appears to be working normally, the writev()
>> call that writes the data to the console appears to be writing the
>> correct character count, so it would appear that the fault is in the
>> kernel.
> 
> If there are any kernel bugs in this area, then they would be that
> last close of the console affects the underlying tty.  The multiple
> console changes are quite likely to have broken this if getty is run
> on the underlying tty (they silently discarded the half-close of the
> underlying tty which was needed to avoided trashing some of its state
> when only the console is closed).

I'm not running getty on my serial console.  It is running on ttyv*. I'm
only using the serial console to capture kernel stack traces, etc.

>> The problem doesn't appear to be specific to syslogd, because I have
>> seen the output from the shutdown scripts that goes to the console get
>> truncated as well.
> 
> Yes, in theory it should affect anything that uses ttymsg() or does
> direct non-blocking writes without waiting for the output to drain.

> Here are some half-baked fixes.  The part that clears O_NONBLOCK is
> wrong, and the usleep() part is obviously a hack.  ttymsg() shouldn't
> block even in close(), since if the close is in the parent ttymsg()
> might block forever and if the close() is in a forked child then
> blocking could create zillions of blocked children.
> 
> Another part of the patch is concerned with limiting forked children.
> If I were happy with that part then blocking would not be so bad.  In
> practice, I don't have enough system activity for blocked children to
> be a problem.  To see the problem with blocked children, do something
> like the following:
> - turn off clocal on the console so that the console can block better.
>   For sio consoles this often requires turning it off in the lock-state
>   device, since the driver defends against this foot shooting by locking
>   it on.
> - hold the console open or otherwise avoid the original bug in this
>   thread, else messages will just be discarded in close() faster than
>   they can pile up.
> - turn off your external console device or otherwise drop carrier.
> - send lots of messages.

I ran into this years ago when I was using actual terminals for console
devices and someone typed a ^S and went away.  The failure mode wasn't
pretty.  It was probably in the days when I was managing Masscomp boxes,
but it might have been the first batch of Sun servers.

> 
> For a non-half-baked fix, do somethng like:
> - never block in ttymsg(), but always wait for output to drain using
>   tcdrain() in a single child process.  It's probably acceptable for
>   this to not report errors to ttymsg()'s caller.
> - limit children better.  I think we now fork children iff writev()
>   returns EWOULDBLOCK and this happens mainly when the tty buffers
>   fill up due to clocal being off and the external console not
>   listening.  Handling this right seems to require handing off the
>   messages to a single child process that can buffer the messages
>   in userland and can block writing and draining them.  Blocked write()s
>   and tcdrain()s are easy enough to handle in a specialized process by
>   sending signals to abort them.

I like this idea better.  The userland process could be smart to drop
entire lines of output if things started getting too backlogged.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200311091030.hA9AUGeF063359>