From owner-freebsd-arch Sat Jan 12 1:28:34 2002 Delivered-To: freebsd-arch@freebsd.org Received: from hawk.prod.itd.earthlink.net (hawk.mail.pas.earthlink.net [207.217.120.22]) by hub.freebsd.org (Postfix) with ESMTP id A3B5637B41E for ; Sat, 12 Jan 2002 01:28:29 -0800 (PST) Received: from pool0051.cvx40-bradley.dialup.earthlink.net ([216.244.42.51] helo=mindspring.com) by hawk.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 16PKSN-0005dB-00; Sat, 12 Jan 2002 01:28:07 -0800 Message-ID: <3C4001A3.5ECCAEB9@mindspring.com> Date: Sat, 12 Jan 2002 01:28:03 -0800 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Bruce Evans Cc: Peter Wemm , Alfred Perlstein , Kelly Yancey , Nate Williams , Daniel Eischen , Dan Eischen , Archie Cobbs , arch@FreeBSD.ORG Subject: Re: Request for review: getcontext, setcontext, etc References: <20020112152622.W4598-100000@gamplex.bde.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Bruce Evans wrote: > (*) It may not be all that good. It was good on old machines when 108 > bytes was a lot of memory and moving the state in and out of the FPU > was slow too. It is possible that the logic to avoid doing the switch > takes longer than always doing it, but not all that likely because logic > speed is increasing faster than memory speed and new machines have more > state to save (512 (?) bytes for SSE). Correct me if my math is wrong, but let's run with this... If I have a 2GHz CPU and 133MHz memory, then we are talking a 16:1 slowdown for a transfer of 512 bytes from register to L1 to L2 to main memory for an FPU state spill. Assuming a 64 bit data path, then we are talking a minimum of 3 * 512/(64/8) * (16:1) or 3k (3076) clocks to save the damn FPU state off to main memory (a store in a loop is 3 clocks ignoring the setup and crap, right?). Add another 3k clocks to bring it back. Best case, God loves us, and we spill and restore from L1 without an IPI or an invalidation, and without starting the thread on a CPU other than the one where it was suspended, and all spills are to cacheable write-through pages. That's a 16 times speed increase because we get to ignore the bus speed differential, or 3 * 512/(65/8) * 2 = (6k/16) = 384 clocks. So it seems to me that it is *incredibly* expensive to do the FPU save and restore, considering what *else* I could be doing with those clock cycles. With an average instruction time of 6 clocks (erring on the side of caution), the question is "can we perform the logic for the avoidance in 64 or less instructions?" I think the answer is "yes", even if we throw in half a dozen uncached memory references to main memory as part of the process and take the 16:1 hit on each of them (that would be 96 clocks in memory references, leaving us 288/6 = 38 instructions to massage whatever we got back from those references). Am I totally off base here? If we assume L1 for all spill restore, and uncached main memory references for checks, we are even *better off* doing the checks when the ratio drops to 8:1 (e.g. a 1GHz processor with 133MHz memory), right? -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message