Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 7 Dec 2004 00:00:55 GMT
From:      Bruce Evans <bde@zeta.org.au>
To:        freebsd-bugs@FreeBSD.org
Subject:   Re: misc/74786: Smartlink Modem causes interrupt storm on RELENG_4 and RELENG_5
Message-ID:  <200412070000.iB700tQI025604@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/74786; it has been noted by GNATS.

From: Bruce Evans <bde@zeta.org.au>
To: Mike Tancsa <mike@sentex.net>
Cc: freebsd-gnats-submit@freebsd.org, freebsd-bugs@freebsd.org
Subject: Re: misc/74786: Smartlink Modem causes interrupt storm on RELENG_4
 and RELENG_5
Date: Tue, 7 Dec 2004 10:55:40 +1100 (EST)

 On Mon, 6 Dec 2004, Mike Tancsa wrote:
 
 > >Description:
 > I think we have been bouncing around this issue for the past few months both on RELENG_4 and RELENG_5.  In the past it has been somewhat difficult to reproduce, but now we can do it reliably.    I dont think its a hardware issue as I can take the exact same 2 boxes with the exact same IRQ assignments and boot with OpenBSD and not run into an interrupt storm or freeze up the box.  Swap back the RELENG_4 or RELENG_5 HD and again, I can produce an interrupt storm at will.
 >
 > I can also reproduce it on 2 different chipsets as well (VIA and Intel).  The problem seems to be around how a PUC device (either a PCI modem or a PCI serial card) and the sharing of an interrupt (usually an USB controller).
 >
 > On RELENG_4, the box just locks up in a race trying to service an interrupt on IRQ 12 but remains unhandled.
 
 This is because interrupt storms are fatal in RELENG_4 (if they happen).
 
 > On RELENG_5, I actually catch an interrupt storm. e.g. I attach to sio4 (PUC modem) and
 >
 > Interrupt storm detected on "irq12: uhci1"; throttling interrupt source
 >
 > Looking at vmstat -i does indeed show a the rate getting throttled
 >
 > releng-5-pioneer# vmstat -i
 > interrupt                          total       rate
 > irq0: clk                         596719         99
 > irq1: atkbd0                           2          0
 > irq4: sio0                          1079          0
 > irq6: fdc0                             1          0
 > irq8: rtc                         763812        127
 > irq12: uhci1                        5825          0
 
 This seems to be from a machine without the problem.  There is no sign
 of a storm here, and no sign of a puc or sio device sharing irq12.
 
 > irq13: npx0                            1          0
 > irq14: ata0                        38727          6
 > irq15: vr0 ata1                     1984          0
 
 The shared case should look like this.  The irq "name" string is too short
 to show more than 1 or 2 devices but I think it would show 2 devices OK
 like it does here.
 
 > Total                            1408150        235
 > releng-5-pioneer#
 >
 > where irq12 is the IRQ shared by the modem and the USB port.  However, because all IRQ 12s get throttled, the modem is unusable. e.g. trying to cu -l /dev/cuaa4 and typing atz takes about 5 seconds.
 
 Does a storm occur when both devices are successfully attached?  Hmm, the
 above is consistent with the following combination:
 1. only usb being attached
 2. the sio device still driving the interrupt but sio not being called to
    handle the interrupt
 3. a very old version of 5.x that has interrupt storm handling with only 1
    interrupt handler call per second for the storming interrupt (later
    versions have HZ interrupts/second)
 4. the old but current rounding bug in vmstat which results in interrupt
    rates of 1-epsilon being displayed as 0 and the clock interrupt rates
    of 100-epsilon and 128-epsilon being displayed as 99 and 127.  The
    above shows a usb rate of approx. 5825/(596719/100) = 0.976.  systat
    would show correctly rounded values.  My version of vmstat has an
    compile-time option to display more precision.
 
 Point (2) shouldn't happen if the device is probed and the probe can find
 the device's irq.  Then the probe turns off the irq.  This is unlikely
 to be a problem for pci devices, but for isa devices it is essential
 to tell the kernel about all devices since interrupts can easily be left
 active after a warm boot from another or the same OS that was using the
 devices.  Such interrupts are deactivated as a side effect of the probe.
 
 > The problem is that the modem is not being seen as a PCI / PUC device and instead is being seen as an ISA SIO device ??  The following RELENG_5 and RELENG_4 patches seem to fix the problem.  I wonder if the other modems listed in sio.c suffer the same fate ?
 
 The primary bug is that bus_setup_intr() still doesn't support dynamic
 choice between fast and normal interrupt handling modes.  All devices
 sharing an irq must use the same mode.  Normal mode must be used unless
 all the relevant drivers support fast mode.  The mode can't be decided
 correctly at attach time or reasonably by drivers at all since the
 full set of drivers is not known at attach time (except for the last
 device, if any).
 
 sio just tries for fast mode first.  If this succeeds then it breaks
 all other devices on the irq that want normal mode.  Minimal breakage
 is for the other devices to not be available.  If their probe or attach
 is buggy or not done then they may cause interrupt storms by driving
 the interrupt.  Whether the try for fast mode succeeds in the shared
 case is too dependent on attach order and upper layers.
 
 Using puc combined with not using PUC_FASTINTR "works" by breaking any
 possibility of using fast mode for puc sio devices.  It makes sio's
 try for fast mode always fail for pci devices.  CY_PCI_FASTINTR does
 the same thing for pci cy devices.  The default is to fail safe (try
 for normal mode only) but to try for fast mode first if *_FASTINTR is
 configured.  The pci layer of sio could implement a similar hack, but
 the layering is not set up for this to be easy, and drivers shouldn't
 have special hacks for this.  The isa layer should only try for fast
 mode since isa irqs can't be shared without special support.
 
 There is only a small relevant difference between PCI and ISA sio
 devices.  It is supposed to be possible to configure sio devices to
 use no irq at all (then they use polled mode and won't cause interrupt
 conflicts) by omitting their irq from the configuration.  Unfortunately,
 configuration became too smart starting with PCI.  The BIOS may support
 moving or not using irqs for PCI devices, but FreeBSD doesn't.
 
 The quickest fix is to change sioattach() to only try for normal mode.
 Normal mode would work better in -current than in RELENG_4 (good enough
 for most configurations, since the latency bugs are reduced), except
 sioattach bogusly asks for non-MPSAFE mode which greatly increases the
 latency bugs relative to RELENG_4.  Fix (?):
 
 %%%
 Index: sio.c
 ===================================================================
 RCS file: /home/ncvs/src/sys/dev/sio/sio.c,v
 retrieving revision 1.442
 diff -u -2 -r1.442 sio.c
 --- sio.c	25 Jun 2004 10:51:33 -0000	1.442
 +++ sio.c	26 Jun 2004 23:11:13 -0000
 @@ -1173,5 +1315,6 @@
  		if (ret) {
  			ret = BUS_SETUP_INTR(device_get_parent(dev), dev,
 -					     com->irqres, INTR_TYPE_TTY,
 +					     com->irqres,
 +					     INTR_TYPE_CLK | INTR_MPSAFE,
  					     siointr, com, &com->cookie);
  			if (ret == 0)
 %%%
 
 This also hacks around the use of the low priority level INTR_TYPE_TTY
 when a very high priority level is preferred.  Using INTR_TYPE_CLK is a
 hack.  A level even higher than that of clocks is preferred.
 
 There may be some brokenness involving layers here.  I thought that
 the above worked, but it shouldn't for puc devices because puc still
 uses INTR_TYPE_TTY and doesn't use INTR_MPSAFE.  It seems to be hard
 for puc to use INTR_TYPE_FASTTTY and INTR_MPSAFE even if all subdevices
 support them.  Whether all subdevices support them is attach ordering
 dependent in the same way as for INTR_FAST.
 
 
 > # cat /var/run/dmesg.boot
 > Copyright (c) 1992-2004 The FreeBSD Project.
 > Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
 >         The Regents of the University of California. All rights reserved.
 > FreeBSD 4.10-STABLE #6: Fri Nov 26 13:52:22 EST 2004
 >     mdtancsa@station.sentex.ca:/usr/obj/usr/src/sys/gas
 
 Is this with the fixed version?  The dmesg for the interrupt storms version
 would be more informative.
 
 > ...
 > puc0: <SmartLink 5634PCV SurfRider> port 0xa400-0xa407 irq 12 at device 5.0 on pci1
 > sio2: type 16550A
 > ...
 > ichsmb0: <Intel 82801EB (ICH5) SMBus controller> port 0x5000-0x501f irq 12 at device 31.3 on pci0
 > smbus0: <System Management Bus> on ichsmb0
 
 The conflict seems to be with ichsmb, not with usb.  I don't know if
 either of ichsmb or usb turn off interrupts properly if their attach is
 not reached or fails.
 
 > >How-To-Repeat:
 > boot a box with a smartlink PCI modem and have it share its interrupt with a usb controller.
 > >Fix:
 >
 > # diff -u sys/isa/sio.c.prev sys/isa/sio.c
 > --- sys/isa/sio.c.prev  Thu Sep  9 20:54:24 2004
 > +++ sys/isa/sio.c       Thu Sep  9 20:54:38 2004
 > @@ -602,7 +602,6 @@
 >         { 0x048011c1, "Lucent kermit based PCI Modem", 0x14 },
 >         { 0x95211415, "Oxford Semiconductor PCI Dual Port Serial", 0x10 },
 >         { 0x7101135e, "SeaLevel Ultra 530.PCI Single Port Serial", 0x18 },
 > -       { 0x0000151f, "SmartLink 5634PCV SurfRider", 0x10 },
 >         { 0x98459710, "Netmos Nm9845 PCI Bridge with Dual UART", 0x10 },
 >         { 0x00000000, NULL, 0 }
 >  };
 
 Removing pci support for one of the few pci sio devices that doesn't need
 puc is not good.
 
 > --- sys/dev/puc/pucdata.c.prev  Thu Sep  9 21:01:30 2004
 > +++ sys/dev/puc/pucdata.c       Thu Sep  9 21:02:48 2004
 > @@ -804,6 +804,15 @@
 >             },
 >         },
 >
 > +        {   "SmartLink 5634PCV SurfRider",
 > +            {   0x151f, 0x0000, 0,      0       },
 > +            {   0xffff, 0xffff, 0,      0       },
 > +            {
 > +                { PUC_PORT_TYPE_COM, 0x10, 0x00, COM_FREQ },
 > +            },
 > +        },
 > +
 > +
 >         /* Actiontec  56K PCI Master */
 >         {   "Actiontec 56K PCI Master",
 >             {   0x11c1, 0x0480, 0x0,    0x0     },
 
 ISTR that the pci support must be removed for this to work.  If so,
 then there is another large ordering bug: things apparently break
 because the pci attach happens to be first.  pci first is probably
 right except for the interrupt mode race since it needs less layers
 at runtime, but the order is undocumented AFAIK.
 
 Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200412070000.iB700tQI025604>