From owner-freebsd-current@FreeBSD.ORG  Wed Dec  7 05:40:36 2005
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
X-Original-To: current@FreeBSD.org
Delivered-To: freebsd-current@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 20B4616A41F
	for <current@FreeBSD.org>; Wed,  7 Dec 2005 05:40:36 +0000 (GMT)
	(envelope-from jhb@FreeBSD.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 09BF843D5A
	for <current@FreeBSD.org>; Wed,  7 Dec 2005 05:40:34 +0000 (GMT)
	(envelope-from jhb@FreeBSD.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.5b3) with ESMTP id 3297026 
	for multiple; Wed, 07 Dec 2005 00:41:56 -0500
Received: from [192.168.0.15] (osx.baldwin.cx [192.168.0.15])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.1/8.13.1) with ESMTP id jB75eSMW053661;
	Wed, 7 Dec 2005 00:40:28 -0500 (EST) (envelope-from jhb@FreeBSD.org)
In-Reply-To: <43961758.4020407@elischer.org>
References: <43961758.4020407@elischer.org>
Mime-Version: 1.0 (Apple Message framework v746.2)
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <1B4F46C2-C424-45F8-9328-BEE2AA6E0DC6@FreeBSD.org>
Content-Transfer-Encoding: 7bit
From: John Baldwin <jhb@FreeBSD.org>
Date: Wed, 7 Dec 2005 00:40:19 -0500
To: Julian Elischer <julian@elischer.org>
X-Mailer: Apple Mail (2.746.2)
X-Spam-Status: No, score=-2.8 required=4.2 tests=ALL_TRUSTED autolearn=failed 
	version=3.0.2
X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on server.baldwin.cx
X-Server: High Performance Mail Server - http://surgemail.com r=1653887525
Cc: current@FreeBSD.org
Subject: Re: can someone explain...[ PCI interrupts] 
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 07 Dec 2005 05:40:36 -0000


On Dec 6, 2005, at 5:57 PM, Julian Elischer wrote:

> In short words for the likes of me,
> Can someone give a quicj roundup on PCI routing in 4.x and -current.

My, what a set of questions. :)  I'll do my best, but this will  
probably be a long and perhaps wandering e-mail.

First off, interrupts for PCI devices are roughly split up into two  
categories (currently): INTx interrupt lines and MSI interrupts.  MSI  
is relatively new and I won't cover it much here.  No versions of  
FreeBSD currently support MSI either (though it's on my todo list),  
so I'll limit this discussion to INTx interrupts.  For INTx  
interrupts, each PCI device (or slot) has 4 interrupt lines: INTA,  
INTB, INTC, and INTD.  Thus, you can describe any individual PCI  
interrupt as a tuple of (bus, slot, pin).  For example, device 4's  
INTA pin on pci bus 0 would be (0, 4, INTA).  Each PCI function is  
allowed to have one INTx interrupt.  The bus and slot come from the  
location of that function in the PCI hierarchy, and the pin comes  
from the intpin PCI config register.  PCI doesn't define beyond the  
INTx pin how an interrupt is delivered to the CPU, etc.  That is all  
a property of the architecture, chipset, etc.

On x86, there are two disparate sets of hardware for managing  
interrupt signals.  The first is the pair of 8259A interrupt  
controllers found on all PC-AT compatible machines.   The second set  
of hardware is the APIC subsystem as it were.  Each processor  
contains a local APIC that can receive messages from other APICs and  
send messages to other local APICs.  In addition to the local APICs,  
the chipset contains 1 or more I/O APICs.  Each I/O APIC contains  
anywhere from 4 to 32 individual interrupt pins.  Common numbers are  
4 (somewhat rare), 16, 24, and 32.  Conceptually, on x86 a given  
interrupt source can be described by the tuple (pic, pin).

Simply put, PCI interrupt routing is the mapping of (bus, slot, pin)  
PCI interrupt tuples to (pic, pin) x86 interrupt tuples.

Now, before delving deeper into the specifics of routing on x86, let  
me digress about IRQs on FreeBSD.  Basically, an IRQ value is a  
cookie useful for binding a device interrupt (such as a PCI (bus,  
slot, pin) tuple or an ISA IRQ) to a x86 interrupt tuple (pic, pin).   
BIOSes don't operate with APICs at all, at least not for handling  
device interrupts.  Thus, they all use a simple mapping where IRQs  
0-7 correspond to pins 0-7 on the master 8259A, and IRQs 8-15 map to  
pins 0-7 on the slave 8259A.  All versions of FreeBSD use the same  
mapping for IRQ cookie values when using the 8259As to route  
interrupts.  For the APIC case the mapping of IRQ cookies to (pic,  
pin) tuples is slightly more complicated.  First, the simple case.   
FreeBSD 5.2 and later follow the ACPI model (even when not using  
ACPI) where the IRQs 0-n correspond to the pins 0-n of the first I/O  
APIC, IRQs n+1 to (n+1)+m map to pins 0 to m of the second I/O APIC,  
etc.  (There is one possible exception with ACPI I'll cover later.)   
FreeBSD 4.x is more complicated.  The reason is that due to cpl and  
spl interrupt masks being 32-bit integers with 8 bits set aside for  
software interrupts (SWIs), cpl only has 24 bits available for  
hardware interrupts.  Therefore, FreeBSD <= 5.2 is limited to IRQ  
values 0-23 and can't use the simple (and intuitive) model that  
FreeBSD 5.2+ and ACPI use.  What FreeBSD 4.x does is to map the ISA  
interrupts attached to the first I/O APIC to IRQs 0, 1, and 3-15.   
This just leaves IRQs 2 and 16-23 available for all the other APIC  
interrupt pins.  As each PCI device registers an interrupt handler  
for a specific (apic, pin) tuple, that x86 interrupt is mapped to one  
of the last set of IRQs.  If all of them have been used already, then  
the kernel starts assigning multiple (apic, pin) tuples to the same  
IRQ resulting in interrupts being shared in software because of the  
cpl limitation even though they aren't shared in hardware.  This is  
why your IRQ values are different on 4.x than on FreeBSD 5.2+ and  
Linux which use the ACPI global interrupt number model.

Now, back to how routing of PCI device interrupts on x86 actually  
works.  I'll cover non-ACPI first.  There are two cases to consider.   
First, the easy case is that a PCI device interrupt (bus, tuple, pin)  
is wired directly to an individual pin on a pic.  This is often how  
interrupts are wired when using APICs.  If you look at the mptable  
output and look at the interrupt section, this is fairly obvious as  
you will see entries that map the interrupt for a given pci bus, slot  
and pin to a given apic id and intpin on that apic.  Thus, there is  
the mapping for (bus, slot, pin) to (pic, pin) directly.  The way  
interrupt routing is implemented in this case is that when we go to  
route an interrupt for a given PCI device, we search the mptable for  
a matching entry.  We then look up the associated apic via its apic  
id, ask it for the specified pin, and then ask that pin for its IRQ  
(via the pic_vector method of the ioapic interrupt source object that  
describes the specific pin).  When nexus(4) does bus_setup_intr(), it  
passes that IRQ to the x86 intr_machdep code which uses the IRQ as an  
index into its interrupt source array and ends up with the interrupt  
source object for the (apic, pin) tuple being used.  (Thus, IRQs are  
just a cookie that is the index into the global array of interrupt  
sources on x86.)  Note that interrupts routed this way are hardwired  
into the motherboard design.  There's no chance for the OS to change  
which (pic, pin) a PCI device interrupt is hooked up to.

For the non-APIC case (non-ACPI still), PCI device interrupts are  
usually wired up to a pin on a programmable interrupt router.  Each  
of these pins is called a pci link device.  Multiple PCI device  
interrupts may be wired up to the same link device, and systems  
typically have anywhere from 4 to 8 (sometimes even more) link  
devices.  Each link device can be independently routed to a given  
(pic, pin) and it is limited to a fixed set of possible IRQs.  If  
multiple link devices are routed to the same IRQ, then all of the  
devices attached to both link devices end up sharing the same IRQ  
(and thus the same ithread, etc.).  Because the link devices are  
independently steerable, this is the one way in which the OS has  
limited flexibility in routing interrupts.  However, the way it works  
is that you route the link devices, not individual PCI device  
interrupts.  The table the BIOS provides with the information about  
the link devices is called the $PIR (since that's the 4 byte  
signature you search for in RAM to find it).  You can see it during a  
verbose boot dmesg.  It is a table that maps a given (bus, slot,  
intpin) PCI tuple to a link index.  Each entry also has a bitmask of  
the valid ISA IRQs ($PIR only allows for the 16 ISA IRQs) that the  
specified link index can be routed to.  Thus, the way that interrupt  
routing works with $PIR is that when a PCI interrupt is routed, you  
search the table for a matching entry to get a link index.  The $PIR  
code in sys/i386/pci/pci_pir.c basically has a list of link objects  
that maintain state about each link.  The code finds the data  
associated with the link index and sees if it has an IRQ routed  
already.  If so, that's the IRQ that that PCI device interrupt is  
assigned to.  If an IRQ isn't routed already, then it has to use an  
algorithm to pick one, make a BIOS call to route the link to the  
chosen IRQ, and then assign the PCI device interrupt to that IRQ.

Now that you understand that, ACPI routing can make some sense.  The  
way that ACPI routing works is that each PCI bus in the ACPI  
namespace has a _PRT method that returns a table of routing entries.   
Each entry contains the slot and intpin that it handles (so that you  
can build the (bus, slot, intpin) PCI tuple (bus comes from the PCI  
bus device _PRT is a child of, in FreeBSD the _PRT is actually a  
child of the pcib(4) device that is the bridge that is the parent of  
the PCI bus, but I digress)) as well as a reference to a link device  
in the ACPI namespace and a source index.  If the link device  
reference is empty or NULL, then the interrupt is a hard-wired  
interrupt such as the ones used with MP Table routing, and the source  
index is the global interrupt number (==IRQ) that you use for this  
interrupt and you are done.  If the link device reference isn't  
empty, then it is the name of a ACPI device object that manages a  
single pci link device.  Example names include \_SB_.PCI0.LPC0.LNKA.   
Each link device object includes methods to query which IRQ it is  
currently routed to (though in practice this is unreliable), get the  
list of possible IRQs, disable the link device altogether, and route  
the link device to a specified IRQ.  This is similar to the link  
objects we have in the $PIR code except that these end up being full  
blown devices on the ACPI side.  ACPI adds another twist in that the  
BIOS is free to use link devices with APICs (MP Table has no way of  
handling that), and in fact in practice there are some nvidia  
chipsets for amd64 that do route some PCI device interrupts to link  
devices that in APIC mode can be routed to any of the IRQs 20-23.

Now some of the minor trivia and exceptions.  On the first I/O APIC,  
IRQ0 is generally routed to intpin 2, not intpin 0 (though many  
motherboards don't actually hook up the IRQ0 output from the ISA  
timer to intpin 2 but do claim to do so in the MP Table and MADT).   
Instead intpin 0 is a special ExtINT pin that listens to the 8259As  
and can forward interrupts from the 8259As to one or more CPUs.  This  
is what "mixed mode" is, and on FreeBSD 4.x, if we discover via a  
test that the motherboard did not wire IRQ0 up to intpin 2, we use  
mixed mode to deliver it via the 8259A bounced through the ExtINT pin  
0 on the first I/O APIC.  Blech.  Also, for ACPI, the SCI is  
generally tied to IRQ 9, however, the SCI may be routed to another  
intpin in APIC mode.  Rather than change the IRQ value in the FADT  
(or whichever table the SCI INT is in), ACPI will include an entry in  
the MADT that maps IRQ 9 to some other intpin such as IRQ 13 or IRQ  
20.  If the new intpin is not an ISA IRQ (> 15) we use a backdoor to  
override the IRQ ACPI uses.  If the new intpin is an ISA IRQ though,  
we actually rename the destination IRQ (such as IRQ 13 on one of my  
boxes) to IRQ 9, and the original IRQ 9 becomes a "dead" interrupt  
pin with no IRQ associated with it.  Note that except for a few rare  
and very old SMP boxes, no FreeBSD x86 machine has an IRQ 2.  Another  
odd case is that some very old SMP boxes did not route PCI device  
interrupts to the APICs at all.  Instead, they routed the outputs of  
the link devices to the pins on the first I/O APIC corresponding to  
the same IRQ as on the 8259A (the I/O APIC only had 16 pins).  Thus,  
on these boxes, PCI interrupts are still routed via link devices via  
$PIR, and end up triggering IRQ X via intpin X on the first I/O  
APIC.  One final twist.  If a PCI bus behind a PCI-PCI bridge is not  
listed in a BIOS table ($PIR or MP Table) or does not have a _PRT in  
ACPI, the interrupts are routed by applying the swizzle defined in  
the PCI standard to route the interrupt via one of the four INTx pins  
on the PCI-PCI bridge's parent PCI bus.  The standard defines this  
behavior for add-in cards, but some built in busses do this as well.   
(I've seen several AGP busses that actually use this method to route  
the VGA IRQ).

> Also, if the "boot interrupt" was previously set to 2, is that  
> likely to have changed in -current?
> Am I now going to get clobbered on IRQ16?  If yes, is this  
> something that teh BIOS writers
> decided, or something that the Motherboard designers decided?

The "boot interrupt" issue on some of the PXH's used for PCI-X and  
PCI-e host bridges is an unpleasant mess.  I think it comes from  
Intel assuming all the world is windows (imagine that) and ignoring  
standards (such as MP Table and ACPI) that it helps to author.  (Yay  
Intel!)  The issue there is that the PXH's include a dedicated I/O  
APIC for each of the two busses the PXH serves, and the PCI device  
interrupts are routed to intpins on those APICs.  To handle the non- 
APIC case, the PXH's forward any device interrupts to the INTx pins  
on the parent side of the PCI-PCI bridge if the APIC is disabled.   
The problem is that Intel chose a hack to figure out if the APIC was  
disabled and that hack interacts badly with FreeBSD.  Basically, if  
the individual intpin is masked in the APIC, the PXH assumes you  
aren't using the APIC to handle interrupts and so it forwards the  
interrupt to the INTx pin on its bridge's parent side.  The problem  
is that after an interrupt comes in on 4.x and later, we mask the  
interrupt in the APIC until we have run the interrupt handler.  The  
reason is that PCI interrupts are level triggered, so they won't  
"shut up" until the ISR has run and pacified the PCI device.  4.x  
masks the interrupt because it wants to not run ISRs with all  
interrupts disabled, but at the same cpl that the interrupt was  
registered at so that higher priority interrupts can still preempt an  
ISR.  5.x and later need to mask the interrupt so that the processor  
doesn't have to keep interrupts disabled until the ithread finishes.   
Trying to do that would become complicated and quite painful since it  
would also mean deferring the EOI to the lapic (which has to happen  
on the same CPU that received the interrupt) and has other nastiness  
since ithreads can block on locks, etc.  Other OS's that use ithreads  
such as BSD/OS and probably Solaris/x86 and Darwin/x86 probably have  
the same issue.  The sucky part is that Intel didn't have to do this  
gross hack.  ACPI requires that the OS call a method _PIC if it wants  
to use APIC mode, and the _PIC method is free to write to registers,  
PCI config space, etc., so Intel could have provided a register to  
specify if the PXH's APIC was being used or not and included the code  
to manage that in _PIC in their sample BIOS.  But, they didn't.

One possible workaround for this issue would be to provide a hacked  
PCI-PCI bridge driver for the PXH's that hacked the PCI interrupt  
routing such that the PCI device interrupts for child devices didn't  
use the APICs in the PXH at all, but used the IRQs that get aliased  
to (such as IRQ16 on 5.2+).  Getting that to work on 4.x might be  
quite painful since 4.x PCI interrupt routing code is rather gross  
and hacky already.

Hopefully this at least answers some questions and gives a good  
overview of what PCI interrupt routing is and how it works, etc.

-- 
John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org