From owner-freebsd-current@FreeBSD.ORG Wed Dec 7 05:40:36 2005 Return-Path: X-Original-To: current@FreeBSD.org Delivered-To: freebsd-current@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 20B4616A41F for ; Wed, 7 Dec 2005 05:40:36 +0000 (GMT) (envelope-from jhb@FreeBSD.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.FreeBSD.org (Postfix) with ESMTP id 09BF843D5A for ; Wed, 7 Dec 2005 05:40:34 +0000 (GMT) (envelope-from jhb@FreeBSD.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.5b3) with ESMTP id 3297026 for multiple; Wed, 07 Dec 2005 00:41:56 -0500 Received: from [192.168.0.15] (osx.baldwin.cx [192.168.0.15]) (authenticated bits=0) by server.baldwin.cx (8.13.1/8.13.1) with ESMTP id jB75eSMW053661; Wed, 7 Dec 2005 00:40:28 -0500 (EST) (envelope-from jhb@FreeBSD.org) In-Reply-To: <43961758.4020407@elischer.org> References: <43961758.4020407@elischer.org> Mime-Version: 1.0 (Apple Message framework v746.2) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <1B4F46C2-C424-45F8-9328-BEE2AA6E0DC6@FreeBSD.org> Content-Transfer-Encoding: 7bit From: John Baldwin Date: Wed, 7 Dec 2005 00:40:19 -0500 To: Julian Elischer X-Mailer: Apple Mail (2.746.2) X-Spam-Status: No, score=-2.8 required=4.2 tests=ALL_TRUSTED autolearn=failed version=3.0.2 X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on server.baldwin.cx X-Server: High Performance Mail Server - http://surgemail.com r=1653887525 Cc: current@FreeBSD.org Subject: Re: can someone explain...[ PCI interrupts] X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 07 Dec 2005 05:40:36 -0000 On Dec 6, 2005, at 5:57 PM, Julian Elischer wrote: > In short words for the likes of me, > Can someone give a quicj roundup on PCI routing in 4.x and -current. My, what a set of questions. :) I'll do my best, but this will probably be a long and perhaps wandering e-mail. First off, interrupts for PCI devices are roughly split up into two categories (currently): INTx interrupt lines and MSI interrupts. MSI is relatively new and I won't cover it much here. No versions of FreeBSD currently support MSI either (though it's on my todo list), so I'll limit this discussion to INTx interrupts. For INTx interrupts, each PCI device (or slot) has 4 interrupt lines: INTA, INTB, INTC, and INTD. Thus, you can describe any individual PCI interrupt as a tuple of (bus, slot, pin). For example, device 4's INTA pin on pci bus 0 would be (0, 4, INTA). Each PCI function is allowed to have one INTx interrupt. The bus and slot come from the location of that function in the PCI hierarchy, and the pin comes from the intpin PCI config register. PCI doesn't define beyond the INTx pin how an interrupt is delivered to the CPU, etc. That is all a property of the architecture, chipset, etc. On x86, there are two disparate sets of hardware for managing interrupt signals. The first is the pair of 8259A interrupt controllers found on all PC-AT compatible machines. The second set of hardware is the APIC subsystem as it were. Each processor contains a local APIC that can receive messages from other APICs and send messages to other local APICs. In addition to the local APICs, the chipset contains 1 or more I/O APICs. Each I/O APIC contains anywhere from 4 to 32 individual interrupt pins. Common numbers are 4 (somewhat rare), 16, 24, and 32. Conceptually, on x86 a given interrupt source can be described by the tuple (pic, pin). Simply put, PCI interrupt routing is the mapping of (bus, slot, pin) PCI interrupt tuples to (pic, pin) x86 interrupt tuples. Now, before delving deeper into the specifics of routing on x86, let me digress about IRQs on FreeBSD. Basically, an IRQ value is a cookie useful for binding a device interrupt (such as a PCI (bus, slot, pin) tuple or an ISA IRQ) to a x86 interrupt tuple (pic, pin). BIOSes don't operate with APICs at all, at least not for handling device interrupts. Thus, they all use a simple mapping where IRQs 0-7 correspond to pins 0-7 on the master 8259A, and IRQs 8-15 map to pins 0-7 on the slave 8259A. All versions of FreeBSD use the same mapping for IRQ cookie values when using the 8259As to route interrupts. For the APIC case the mapping of IRQ cookies to (pic, pin) tuples is slightly more complicated. First, the simple case. FreeBSD 5.2 and later follow the ACPI model (even when not using ACPI) where the IRQs 0-n correspond to the pins 0-n of the first I/O APIC, IRQs n+1 to (n+1)+m map to pins 0 to m of the second I/O APIC, etc. (There is one possible exception with ACPI I'll cover later.) FreeBSD 4.x is more complicated. The reason is that due to cpl and spl interrupt masks being 32-bit integers with 8 bits set aside for software interrupts (SWIs), cpl only has 24 bits available for hardware interrupts. Therefore, FreeBSD <= 5.2 is limited to IRQ values 0-23 and can't use the simple (and intuitive) model that FreeBSD 5.2+ and ACPI use. What FreeBSD 4.x does is to map the ISA interrupts attached to the first I/O APIC to IRQs 0, 1, and 3-15. This just leaves IRQs 2 and 16-23 available for all the other APIC interrupt pins. As each PCI device registers an interrupt handler for a specific (apic, pin) tuple, that x86 interrupt is mapped to one of the last set of IRQs. If all of them have been used already, then the kernel starts assigning multiple (apic, pin) tuples to the same IRQ resulting in interrupts being shared in software because of the cpl limitation even though they aren't shared in hardware. This is why your IRQ values are different on 4.x than on FreeBSD 5.2+ and Linux which use the ACPI global interrupt number model. Now, back to how routing of PCI device interrupts on x86 actually works. I'll cover non-ACPI first. There are two cases to consider. First, the easy case is that a PCI device interrupt (bus, tuple, pin) is wired directly to an individual pin on a pic. This is often how interrupts are wired when using APICs. If you look at the mptable output and look at the interrupt section, this is fairly obvious as you will see entries that map the interrupt for a given pci bus, slot and pin to a given apic id and intpin on that apic. Thus, there is the mapping for (bus, slot, pin) to (pic, pin) directly. The way interrupt routing is implemented in this case is that when we go to route an interrupt for a given PCI device, we search the mptable for a matching entry. We then look up the associated apic via its apic id, ask it for the specified pin, and then ask that pin for its IRQ (via the pic_vector method of the ioapic interrupt source object that describes the specific pin). When nexus(4) does bus_setup_intr(), it passes that IRQ to the x86 intr_machdep code which uses the IRQ as an index into its interrupt source array and ends up with the interrupt source object for the (apic, pin) tuple being used. (Thus, IRQs are just a cookie that is the index into the global array of interrupt sources on x86.) Note that interrupts routed this way are hardwired into the motherboard design. There's no chance for the OS to change which (pic, pin) a PCI device interrupt is hooked up to. For the non-APIC case (non-ACPI still), PCI device interrupts are usually wired up to a pin on a programmable interrupt router. Each of these pins is called a pci link device. Multiple PCI device interrupts may be wired up to the same link device, and systems typically have anywhere from 4 to 8 (sometimes even more) link devices. Each link device can be independently routed to a given (pic, pin) and it is limited to a fixed set of possible IRQs. If multiple link devices are routed to the same IRQ, then all of the devices attached to both link devices end up sharing the same IRQ (and thus the same ithread, etc.). Because the link devices are independently steerable, this is the one way in which the OS has limited flexibility in routing interrupts. However, the way it works is that you route the link devices, not individual PCI device interrupts. The table the BIOS provides with the information about the link devices is called the $PIR (since that's the 4 byte signature you search for in RAM to find it). You can see it during a verbose boot dmesg. It is a table that maps a given (bus, slot, intpin) PCI tuple to a link index. Each entry also has a bitmask of the valid ISA IRQs ($PIR only allows for the 16 ISA IRQs) that the specified link index can be routed to. Thus, the way that interrupt routing works with $PIR is that when a PCI interrupt is routed, you search the table for a matching entry to get a link index. The $PIR code in sys/i386/pci/pci_pir.c basically has a list of link objects that maintain state about each link. The code finds the data associated with the link index and sees if it has an IRQ routed already. If so, that's the IRQ that that PCI device interrupt is assigned to. If an IRQ isn't routed already, then it has to use an algorithm to pick one, make a BIOS call to route the link to the chosen IRQ, and then assign the PCI device interrupt to that IRQ. Now that you understand that, ACPI routing can make some sense. The way that ACPI routing works is that each PCI bus in the ACPI namespace has a _PRT method that returns a table of routing entries. Each entry contains the slot and intpin that it handles (so that you can build the (bus, slot, intpin) PCI tuple (bus comes from the PCI bus device _PRT is a child of, in FreeBSD the _PRT is actually a child of the pcib(4) device that is the bridge that is the parent of the PCI bus, but I digress)) as well as a reference to a link device in the ACPI namespace and a source index. If the link device reference is empty or NULL, then the interrupt is a hard-wired interrupt such as the ones used with MP Table routing, and the source index is the global interrupt number (==IRQ) that you use for this interrupt and you are done. If the link device reference isn't empty, then it is the name of a ACPI device object that manages a single pci link device. Example names include \_SB_.PCI0.LPC0.LNKA. Each link device object includes methods to query which IRQ it is currently routed to (though in practice this is unreliable), get the list of possible IRQs, disable the link device altogether, and route the link device to a specified IRQ. This is similar to the link objects we have in the $PIR code except that these end up being full blown devices on the ACPI side. ACPI adds another twist in that the BIOS is free to use link devices with APICs (MP Table has no way of handling that), and in fact in practice there are some nvidia chipsets for amd64 that do route some PCI device interrupts to link devices that in APIC mode can be routed to any of the IRQs 20-23. Now some of the minor trivia and exceptions. On the first I/O APIC, IRQ0 is generally routed to intpin 2, not intpin 0 (though many motherboards don't actually hook up the IRQ0 output from the ISA timer to intpin 2 but do claim to do so in the MP Table and MADT). Instead intpin 0 is a special ExtINT pin that listens to the 8259As and can forward interrupts from the 8259As to one or more CPUs. This is what "mixed mode" is, and on FreeBSD 4.x, if we discover via a test that the motherboard did not wire IRQ0 up to intpin 2, we use mixed mode to deliver it via the 8259A bounced through the ExtINT pin 0 on the first I/O APIC. Blech. Also, for ACPI, the SCI is generally tied to IRQ 9, however, the SCI may be routed to another intpin in APIC mode. Rather than change the IRQ value in the FADT (or whichever table the SCI INT is in), ACPI will include an entry in the MADT that maps IRQ 9 to some other intpin such as IRQ 13 or IRQ 20. If the new intpin is not an ISA IRQ (> 15) we use a backdoor to override the IRQ ACPI uses. If the new intpin is an ISA IRQ though, we actually rename the destination IRQ (such as IRQ 13 on one of my boxes) to IRQ 9, and the original IRQ 9 becomes a "dead" interrupt pin with no IRQ associated with it. Note that except for a few rare and very old SMP boxes, no FreeBSD x86 machine has an IRQ 2. Another odd case is that some very old SMP boxes did not route PCI device interrupts to the APICs at all. Instead, they routed the outputs of the link devices to the pins on the first I/O APIC corresponding to the same IRQ as on the 8259A (the I/O APIC only had 16 pins). Thus, on these boxes, PCI interrupts are still routed via link devices via $PIR, and end up triggering IRQ X via intpin X on the first I/O APIC. One final twist. If a PCI bus behind a PCI-PCI bridge is not listed in a BIOS table ($PIR or MP Table) or does not have a _PRT in ACPI, the interrupts are routed by applying the swizzle defined in the PCI standard to route the interrupt via one of the four INTx pins on the PCI-PCI bridge's parent PCI bus. The standard defines this behavior for add-in cards, but some built in busses do this as well. (I've seen several AGP busses that actually use this method to route the VGA IRQ). > Also, if the "boot interrupt" was previously set to 2, is that > likely to have changed in -current? > Am I now going to get clobbered on IRQ16? If yes, is this > something that teh BIOS writers > decided, or something that the Motherboard designers decided? The "boot interrupt" issue on some of the PXH's used for PCI-X and PCI-e host bridges is an unpleasant mess. I think it comes from Intel assuming all the world is windows (imagine that) and ignoring standards (such as MP Table and ACPI) that it helps to author. (Yay Intel!) The issue there is that the PXH's include a dedicated I/O APIC for each of the two busses the PXH serves, and the PCI device interrupts are routed to intpins on those APICs. To handle the non- APIC case, the PXH's forward any device interrupts to the INTx pins on the parent side of the PCI-PCI bridge if the APIC is disabled. The problem is that Intel chose a hack to figure out if the APIC was disabled and that hack interacts badly with FreeBSD. Basically, if the individual intpin is masked in the APIC, the PXH assumes you aren't using the APIC to handle interrupts and so it forwards the interrupt to the INTx pin on its bridge's parent side. The problem is that after an interrupt comes in on 4.x and later, we mask the interrupt in the APIC until we have run the interrupt handler. The reason is that PCI interrupts are level triggered, so they won't "shut up" until the ISR has run and pacified the PCI device. 4.x masks the interrupt because it wants to not run ISRs with all interrupts disabled, but at the same cpl that the interrupt was registered at so that higher priority interrupts can still preempt an ISR. 5.x and later need to mask the interrupt so that the processor doesn't have to keep interrupts disabled until the ithread finishes. Trying to do that would become complicated and quite painful since it would also mean deferring the EOI to the lapic (which has to happen on the same CPU that received the interrupt) and has other nastiness since ithreads can block on locks, etc. Other OS's that use ithreads such as BSD/OS and probably Solaris/x86 and Darwin/x86 probably have the same issue. The sucky part is that Intel didn't have to do this gross hack. ACPI requires that the OS call a method _PIC if it wants to use APIC mode, and the _PIC method is free to write to registers, PCI config space, etc., so Intel could have provided a register to specify if the PXH's APIC was being used or not and included the code to manage that in _PIC in their sample BIOS. But, they didn't. One possible workaround for this issue would be to provide a hacked PCI-PCI bridge driver for the PXH's that hacked the PCI interrupt routing such that the PCI device interrupts for child devices didn't use the APICs in the PXH at all, but used the IRQs that get aliased to (such as IRQ16 on 5.2+). Getting that to work on 4.x might be quite painful since 4.x PCI interrupt routing code is rather gross and hacky already. Hopefully this at least answers some questions and gives a good overview of what PCI interrupt routing is and how it works, etc. -- John Baldwin <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve" = http://www.FreeBSD.org