Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 14 Jun 1998 11:00:01 -0700 (PDT)
From:      Matthew Dillon <dillon@backplane.com>
To:        freebsd-bugs@FreeBSD.ORG
Subject:   Re: i386/6944: bug in i386/isa/icu_ipl.s - AST gets lost, causes extreme network slowdown when cpu-bound processes present, possibly other problems
Message-ID:  <199806141800.LAA11833@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR i386/6944; it has been noted by GNATS.

From: Matthew Dillon <dillon@backplane.com>
To: Bruce Evans <bde@zeta.org.au>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: i386/6944: bug in i386/isa/icu_ipl.s - AST gets lost, causes extreme network slowdown when cpu-bound processes present, possibly other problems
Date: Sun, 14 Jun 1998 10:54:42 -0700 (PDT)

 :>	    cmpl	$SWI_AST,%ecx
 :>     	    je		splz_nextx		/* "can't happen" */
 :>
 :>	Actually can happen.  I'm not exactly sure how it happens, but the
 :>	result is that that AST gets cleared from ipending without being run.
 :
 :It "can't happen" because SWI_AST_MASK is "always" set in `cpl' until
 :the kernel is about to return to user mode.  Something must be clearing
 :SWI_AST_MASK in `cpl' or in the cpl to be "restored".  The typo spl(0)
 :instead of spl0() would do this.  Please look for whatever does it.
 :This may be as simple as looking at the stack trace to see spl(0) and
 :verifying that SWI_AST_MASK is set (you can't trust the latter since
 :ddb doesn't mask interrupts).
 :
 :Bruce
 
     Well, I spent 6 hours from 9p.m. to 3a.m. just find this :-)  I'm going
     to leave the finding of the broken spl to someone else, but there ARE
     several places where $0 is loaded into the cpl in the assembly, and 
     other places where the interrupt nesting count is manually reset to 1.
     I'm not sure it's necessary to 'reset' the cpl states, the standard
     interrupt context push/pop ought to do that inherently so if things are
     being left dangling there's definitely something wrong elsewhere in the
     code that these manual resets are 'covering up'.  It could be anywhere.
     The spl0()/splz() stuff is a mess and should probably be removed entirely.
 
     The problem is extremely reproducable... just NFS mount / and /usr from
     a server to a workstation, run a for (;;); process on the server, and
     try to run xterm on the workstation and, poof. 
 
     When I did this, vmstat showed the number of context switches never 
     exceeded 100.  Hmm... suspicious!  Without ./x (the for (;;); process)
     running, the number of context switches went to 600+/sec for two seconds
     to load xterm via NFS.  With ./x running the number of context switches
     was around 50/sec and running xterm on the client increased it to only 
     100/sec, and xterm took forever to load via nfs.
 
     With the fix and ./x running, xterm took only 2 seconds to load via
     NFS and was completely uneffected by the existance of the cpu-bound
     task.
 
     -
 
     I'd suggest changing the assembly to do a sanity check of the cpl rather
     then simply save/restore it around an SWI (or normal interrupt for that
     matter)... if the cpl isn't in the state it left it before the call
     to the handler, printf() a warning.
 
     I also noticed that the fast interrupt code doesn't save/restore the cpl
     around the call to the interrupt handler, but the 'normal' interrupt 
     code does.  I believe the code thinks this is ok because it's leaving
     the cpu CLI'd through the call, but I actually think the slow interrupt
     handler results in faster operation because the interrupt context doesn't
     get popped & repushed through a ring change if a nested interrupt occurs.
     I also submit that the fast interrupt code doesn't make the system any 
     more responsive...  the two critical time-sensitive interrupts are the 
     ethernet rx and the serial rx and neither is able to keep up as it
     stands... our 100BaseTX boards almost universally get knocked back into
     store and forward mode due to rx overruns after the machine's been up 
     for a while, and anyone with a digital camera can tell you that the 
     serial interrupt sucks rocks in terms of being able to process exceptions 
     at a high rate in unhandshaked mode without overrunning.  Running a 
     'fast' interrupt with interrupts disabled isn't a hot idea when the 'fast'
     interrupt isn't the serial or network receive interrupt!
 
     But whatever the case, the core assembly shouldn't be gratuitiously
     clearing the AST from ipending if it doesn't intend to run the AST
     trap :-)
 
 						-Matt
 
     Matthew Dillon   Engineering, BEST Internet Communications, Inc.
 		     <dillon@backplane.com>
     [always include a portion of the original email in any response!]
 

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199806141800.LAA11833>