Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 29 Jun 2008 10:36:29 +0200
From:      Christoph Mallon <christoph.mallon@gmx.de>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        cvs-src@freebsd.org, src-committers@freebsd.org, cvs-all@freebsd.org, Marius Strobl <marius@alchemy.franken.de>
Subject:   Re: cvs commit: src/sys/sparc64/include in_cksum.h
Message-ID:  <4867498D.5050409@gmx.de>
In-Reply-To: <20080629121025.K92490@delplex.bde.org>
References:  <200806252105.m5PL5AUp064418@repoman.freebsd.org>	<48654667.1040401@gmx.de>	<20080627222404.GJ1215@alchemy.franken.de> <48657058.6020102@gmx.de>	<20080628121025.F89039@delplex.bde.org> <4865F89D.4090207@gmx.de> <20080629121025.K92490@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Bruce Evans wrote:
>>> Right.  Though I've never seen unnecessary's __volatiles significantly
>>> affecting i386 code.  This is because the code in the asms can't be
>>> removed completely, and can't be moved much either.  With out of order
>>> execution, the type of moves that are permitted (not across 
>>> dependencies)
>               
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> are precisely the type of moves that the CPU's scheduler can do or undo
>>> no matter how the compiler orders the code.
>>
>> I disagree. For example look at the use of in_addword() in 
>> dev/sk/if_sk.cv in line 2819:
>>  csum1 = htons(csum & 0xffff);
>>  csum2 = htons((csum >> 16) & 0xffff);
>>  ipcsum = in_addword(csum1, ~csum2 & 0xffff);
>>  /* checksum fixup for IP options */
>>  len = hlen - sizeof(struct ip);
>>  if (len > 0) {
>>    return;
>>  }
>>
>> The calculation will be executed even if the following if (len > 0) 
>> leaves the function and the value of ipcsum is unused.
>> If in_addword() is not marked volatile it can be moved after the if 
>> and not be executed in all cases. csum1 and csum2 can be moved after 
>> the if, too.
> 
> No, volatile has no effect on whether the above calculation will be
> executed, since the early return has no dependencies on the caclulation.

The volatile induces a dependency.

> Old versions of gcc used to handle volatile like that, but this changed
> in gcc-3 or earlier.  gcc.info now says:
> 
> % The `volatile' keyword indicates that the instruction has important
> % side-effects.  GCC will not delete a volatile `asm' if it is reachable.
>                                                       ^^^^^^^^^^^^^^^^^^^

This is not about whether the code is reachable or not (it is 
reachable), it is about whether the result is used (i.e. whether the 
code is dead).

> % (The instruction can still be deleted if GCC can prove that
> % control-flow will never reach the location of the instruction.)  Note
> % that even a volatile `asm' instruction can be moved relative to other
>        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> % code, including across jump instructions.  For example, on many targets
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

jump != conditional jump.
If it moves the *volatile* asm statement across the if, it would only 
appear on *some* execution paths, which is wrong. It is perfectly fine 
to move it, when the statement is not volatile, though.

> Even if gcc didn't move the caclulation, then CPUs with out of order
> execution might schedule it so that it is effectively never executed
> (most likely by executing it in otherwise-unused pipelines while the
> main pipeline returns).  This is valid for the same reasons that gcc
> can move the volatile asms -- the return doesn't depend on the result
> of the caclulation.

This is for the CPU to decide. If the assembler block really contains 
"important" stuff like memory barriers, writes to machine control 
registers etc., the CPU will not "effectively never execute" the code. 
The compiler does not know this, all it sees is the word "volatile".

> The above C code is fairly bad, but generates not so bad code on i386:
> 
> % %     movl    %esi, %eax
> % #APP
> %     xchgb %ah, %al        # byte operations can be slow; this one not
>                 # too bad, but I wonder if rorw $8 is better
>                 # (rorl $16 is already used for corresponding
>                 # 32-bit operations) where there is no xchg
>                 # alternative
> % #NO_APP

And this again is an example why not to use inline assembler, but let 
the compiler decide this:

unsigned short swap16(unsigned short x)
{
   return x >> 8 | x << 8;
}

is compiled to

swap16:
         movl    4(%esp), %eax
         rolw    $8, %ax
         ret

The compiler is even able to do optimisations, which it absolutely 
cannot do, if inline assembler is used, example:

unsigned short id(unsigned short x)
{
   return swap16(swap16(x));
}

results in

id:
         movl    4(%esp), %eax
         ret

Maybe the MD htons() macros should be replaced by MI code.

> %     shrl    $16, %esi
> %     movl    %esi, %edx
> % #APP
> %     xchgb %dh, %dl        # as above
> % #NO_APP
> %     notl    %edx        # poor asm code -- the top 16 bits are unused
>                 # except here to stall for merging them with
>                 # the previous byte operation

The compiler simply does not know, that the inline assembler only 
operates on parts of the register. Another reason not to use inline 
assembler:

u_int g(u_short x)
{
   return ~swap16(x);
}

g:
         movl    4(%esp), %eax
         rolw    $8, %ax
         movzwl  %ax, %eax         # avoid stall
         notl    %eax
         ret


Regards
	Christoph



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4867498D.5050409>