From owner-freebsd-smp Sat Sep 21 11:47:34 2002 Delivered-To: freebsd-smp@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DF1C337B401; Sat, 21 Sep 2002 11:47:30 -0700 (PDT) Received: from flamingo.mail.pas.earthlink.net (flamingo.mail.pas.earthlink.net [207.217.120.232]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6E80343E75; Sat, 21 Sep 2002 11:47:30 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from pool0278.cvx22-bradley.dialup.earthlink.net ([209.179.199.23] helo=mindspring.com) by flamingo.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 17spHr-0000Ub-00; Sat, 21 Sep 2002 11:47:28 -0700 Message-ID: <3D8CBE7D.877EA3A4@mindspring.com> Date: Sat, 21 Sep 2002 11:46:21 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: beemern Cc: smp@freebsd.org, jhb@freebsd.org Subject: Re: For those with P4 SMP problems.. References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-smp@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org beemern wrote: > i'm preparing to start in on Terry Lambert's suggestion, however, > perhaps you (anyone) could clear up a few minor questions.. > > -he says other systems are "matching CPUs started at the time of the > check" > ..matching them with what? The theory is that the BIOS has the corect information, but in the wrong order, and FreeBSD cares about the order, but Linux and Windows do not, because they;ve performed an additional optimization that lets them start the APs simultaneously, and a side effect of this is that they don't care about order of start, they just care *that* they start. /* * start each AP in our list */ static int start_all_aps(u_int boot_addr) { ... /* start each AP */ for (x = 1; x <= mp_naps; ++x) { ... bootSTK = &SMP_prvspace[x].idlestack[UPAGES*PAGE_SIZE]; bootAP = x; /* attempt to start the Application Processor */ ... if (!start_ap(x, boot_addr)) { ... } ... /* record its version info */ cpu_apic_versions[x] = cpu_apic_versions[0]; all_cpus |= (1 << x); /* record AP in CPU map */ } ... /* * this function starts the AP (application processor) identified * by the APIC ID 'physicalCpu'. It does quite a "song and dance" * to accomplish this. This is necessary because of the nuances * of the different hardware we might encounter. It ain't pretty, * but it seems to work. */ static int start_ap(int logical_cpu, u_int boot_addr) { ... /* get the PHYSICAL APIC ID# */ physical_cpu = CPU_TO_ID(logical_cpu); ... } ...basically, what is happening here is that there is an iteration through all of the logical CPUs, which is then used to start the physical CPUs. If they are all started at the same time, and the results are collected before they are compared, as in Linux or Windows NT, then the results are that they all start. If they are attempted to be started serially, and the results are also collected serially, then the result is that they do not start. The implication here is clear: serial start fails because the APIC ID the BIOS claims is assigned to each CPU is not the APIC ID which was actually assigned to the CPU. But the concurrent startup works, because the set of IDs known to the BIOS matches the set of IDs assigned to the CPUs. So you get the right answer to the "has this started?" question, but you don't get the answer from the physical CPU you expected. The serial start depends on getting the correct answer from the CPU you expected, rather than just getting the correct answer and not caring about the man behind the curtain. Probably, the canonically correct thing to do would be to start each CPU with code that reassigns it's real APIC ID into the logical APIC ID, so that there is no longer a mismatch. > -also, shouldn't our whole exercise of exhuastively hardcoding the apic > for cpu1 from 1 to 11 have found out which one was the REAL one? Not really. If you look at the code, there's a bunch of coupled information. By serially attempting to start it, you assume not only that the APIC ID that the BIOS erroneously believes to be correct is used, but that the associated stack and other information is also known to the processor. Basically, you aren't going to be able to safely do about 4 things in the same order, and expect them to work. The start_all_aps() code needs to be refactored, amd the start_ap() code needs to be broken into between 3 and 5 parts (depending on how you handle making the APs correspond to the the logical APs), and unrolled so that it can b. done concurrently, instead of depending on serial success. > -finally, it appears Mr. Lambert is suggesting 2 mutually exclusive > solutions (correct?) ..where the second one ("For extra points...") looks > like the more complete and "right" solution, however, as noted in the > previous question, shouldn't we have hit upon the correct id already by > playing with the physical_cpu and CPU_TO_ID() as i and Mr. Feldkamp have > been? > > thanks for any further input/direction you can give.. i'm gonna poke > around in the src and find where the cpu->apic assignments are made > originally and just see what i can see You should be able to start everything up, not caring about the logical vs. physical APIC ID mapping, as long as you start all the CPUs. What will break, however, is that if the BIOS doesn't simply contain the right physical APIC IDs, out of order, or if you need to send a targetted IPI, instead of a broadcast IPI. So the two "solutions" boil down to correcting the physical/logical mapping, or reloading the physical APIC ID register. Either one works, but reloading the register lets you get rid of the logical and physical indirection (assuming you shove the I/O APIC off to ID 31, the last ID). How correct, and when correct, these have to be really depends on how often the logical to physical translation happens, in order to explicitly signal a CPU. I'd have to read all the -current code in considerable detail to answer that question, or just punt, and come up with a fix where the answer to the question ends up not mattering. That's the way I prefer... 8-). Rewriting the APIC ID in each auxillary CPU is a pain in the neck; the BIOS does it by holding 5 bits worth of pins on each CPU to a specific value. You can do it, in theory: the code should not need the BIOS to do the assignment to function, if you don't care about not starting some CPUs, or starting particular ones... that gets around all the normal BIOS bugs related to CPU detection, but it's a much harder problem to solve, since you have to have a free APIC ID to let you shuffle things around (hence the extra points ;^)). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-smp" in the body of the message