From owner-freebsd-questions@FreeBSD.ORG Sun May 25 03:05:56 2003 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7774837B401 for ; Sun, 25 May 2003 03:05:56 -0700 (PDT) Received: from mallaury.noc.nerim.net (smtp-100-sunday.noc.nerim.net [62.4.17.100]) by mx1.FreeBSD.org (Postfix) with ESMTP id 62A6F43F3F for ; Sun, 25 May 2003 03:05:55 -0700 (PDT) (envelope-from shill@free.fr) Received: from free.fr (venus.adsl.nerim.net [62.4.18.246]) by mallaury.noc.nerim.net (Postfix) with ESMTP id B397462D03; Sun, 25 May 2003 12:05:52 +0200 (CEST) Message-ID: <3ED094D0.2040303@example.com> Date: Sun, 25 May 2003 12:02:56 +0200 From: Shill User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.2.1) Gecko/20021130 X-Accept-Language: en-us, en MIME-Version: 1.0 To: freebsd-questions@freebsd.org References: In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: Re: ELF .data section variables and RWX bits X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 25 May 2003 10:05:58 -0000 >>_start: >> rdtsc >> mov ebp, eax >> xor eax, eax >> cpuid >> ; BEGIN TIMED CODE >> >> ; END TIMED CODE >> xor eax, eax >> cpuid >> rdtsc >> sub ebp, eax >> xor eax, eax >> cpuid >> neg ebp >> >>Note: CPUID is used only as a serializing instruction. > > You write that like you know what it means, but I'm not convinced > that you do. It is, indeed, possible that I have misunderstood the concept. IA-32 Software Developer's Manual, Volume 3: System Programming Guide Section 7.4 The IA-32 architecture defines several serializing instructions. These instructions force the processor to complete all modifications to flags, registers, and memory by previous instructions and to drain all buffered writes to memory before the next instruction is fetched and executed. When the processor serializes instruction execution, it ensures that all pending memory transactions are completed, including writes stored in its store buffer, before it executes the next instruction. Nothing can pass a serializing instruction, and serializing instructions cannot pass any other instruction (read, write, instruction fetch, or I/O). The CPUID instruction can be executed at any privilege level to serialize instruction execution with no effect on program flow, except that the EAX, EBX, ECX, and EDX registers are modified. The following additional information is worth noting regarding serializing instructions: The processor does not writeback the contents of modified data in its data cache to external memory when it serializes instruction execution. Software can force modified data to be written back by executing the WBINVD instruction, which is a serializing instruction. I set eax to 0 before I call CPUID because different CPUID functions have different latencies. I use CPUID to ensure that code from the timed function is not mixed with code from the stub. In fact, my stub is so short that if I removed every CPUID, the outcome would probably be similar. >>I wanted to time the latency of a store, > > You have timed your programs, but I don't know what you mean > here by "Latency". You can replace "latency" with "number of cycles required to complete". As you already know, in a pipelined processor, one can look at the latency of a single instruction, which is somewhat related to the depth of the pipeline. >>I timed three different programs: >>P1) mov ebx, [X] ; load i.e. read only >>P2) mov dword [X], 0xaabbccdd ; store i.e. write only >>P3) add dword [X], byte 0x4C ; load/execute/store i.e. read+write >> >>P1 requires 170 cycles. >>P2 requires 12000 cycles on average (MIN=10000 and MAX=46000) >>P3 requires 22500 cycles on average (MIN=14500 and MAX=72000) >> >>A cache miss might explain why P1 requires 170 cycles but it does not >>explain P2 or P3, as far as I know. >> >>My guess is that the first time X is written to, an exception occurs >>(perhaps a TLB miss) and the operating system (FreeBSD in my case) is >>required to update something, somewhere. >> >>Could it be that FreeBSD does not set the write bit for the page where X >>is stored until X is *actually* written to? But then why would P3 take >>much longer than P2? > > 1. Ideally your data segment should be aligned on a 4096 byte boundary, > this is the normal Page size for IA32 (32-bit 386, 486, Pentium, > etc...) memory. I will try to fiddle around with the alignment constraints. > 2. The big delays in the program that does the write are probably > because the OS loads the program data into "copy on write" pages. > (This is generally favoured where the same page of a program > contains both data & code, or where the same program image may be > shared by multiple concurrent instances. Specifically for FreeBSD > Google-ing for "copy on write" returns lots of matches.) > These indeed cause the OS to intervene on a write fault, creating a > new page with the same content before resuming the write operation. Very interesting. I will look into "copy on write". I think you've nailed part of the answer. > To get a reasonable time measurement you should; > a. Write to the data address before running your test, this moves > the copy on write operation out of you measurement. > b. (optional) Put a loop around your test code that runs the test > multiple times. (Rather than run the program lots of times.) > This will probably give faster times as your test code will > likely be cached by the CPU the first time through. (Exact > implementation of your loop, timing of interrupts, etc... will > affect this somewhat.) I was not interested in the measurement per se, which is why accuracy is not a concern to me. Rather, I was intrigued by the enormous difference between reads and writes. > 3. On Pentium & Athlon processor families it is not very appropriate to > time single instructions. These processors break down instructions > into RISC style mini-instructions. They also do dynamic analysis of > which instructions depend on each other, e.g. for register contents > being valid. They then try to execute multiple RISC mini- > instructions, corresponding to more than one x386 instruction, in > each clock cycle. > > Instructions may even be executed in a different order, (which > requires very clever work to undo when an interrupt is received or a > trap generated.) Serializing instructions, such as CPUID, force the > CPU to complete out-of-order execution and only execute that > instruction. I do know a bit about superscalar cores and speculative execution. As I've stated above, accuracy was not a concern. > This means the time taken to execute an instruction is substantially > affected by where it is placed in the program. Conversely an extra > placed in a program could either be executed with no extra cycles > (it "pairs" with another instruction) or add many extra cycles. Very true. Actually, the Athlon's integer pipeline has a width of 3, which, I suppose, is why you've placed "pairs" between quotes. > 4. Read/modify/write instructions are compact and work well on the 8086 > but not so well on the i486 & up. Intel suggest using a register > based programming model: load register, modify register, store > register. These instructions can usually be scheduled by a compiler > with other instructions for maximum speed. (Sorry I can't find a > reference for this just now.) Perhaps you're thinking about the Pentium 4? I would not expect Intel to offer optimization guidelines concerning the K7 microarchitecture ;) AMD Athlon Processor x86 Code Optimization Guide Chapter 4 - Instruction Decoding Optimizations Use Read-Modify-Write Instructions Where Appropriate The AMD Athlon processor handles read-modify-write (RMW) instructions such as "ADD [mem], reg32" very efficiently. The vast majority of RMW instructions are DirectPath instructions. Use of RMW instructions can provide a performance benefit over the use of an equivalent combination of load, load-execute and store instructions. In comparison to the load/loadexecute/ store combination, the equivalent RMW instruction promotes code density (better I-cache utilization), preserves decode bandwidth, and saves execution resources as it occupies only one reservation station and requires only one address computation. It may also reduce register pressure, as demonstrated in Example 2. Use of RMW instructions is indicated if an operation is performed on data that is in memory, and the result of that operation is not reused soon. Due to the limited number of integer registers in an x86 processor, it is often the case that data needs to be kept in memory instead of in registers. Additionally, it can be the case that the data, once operated upon, is not reused soon. An example would be an accumulator inside a loop of unknown trip count, where the accumulator result is not reused inside the loop. Note that for loops with a known trip count, the accumulator manipulation can frequently be hoisted out of the loop. On a related note, the use of load-execute instructions is one of the top optimizations for the Athlon. Use Load-Execute Integer Instructions Most load-execute integer instructions are DirectPath decodable and can be decoded at the rate of three per cycle. Splitting a load-execute integer instruction into two separate instructions - a load instruction and a "reg, reg" instruction - reduces decoding bandwidth and increases register pressure, which results in lower performance. Use the split-instruction form to avoid scheduler stalls for longer executing instructions and to explicitly schedule the load and execute operations. I think I'm still not sure how to explain the difference between P2 and P3, i.e. a simple write versus a write after read... Anyway... Thank you very much for your post. I'm off to read up on "copy on write" Shill