Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 19 Aug 1999 22:28:50 +0800
From:      Peter Wemm <peter@netplex.com.au>
To:        Bruce Evans <bde@zeta.org.au>
Cc:        cvs-all@FreeBSD.org, cvs-committers@FreeBSD.org
Subject:   Re: cvs commit: src/sys/i386/include cpufunc.h 
Message-ID:  <19990819142850.C78D81C9F@overcee.netplex.com.au>
In-Reply-To: Your message of "Thu, 19 Aug 1999 17:27:33 %2B1000." <199908190727.RAA14796@godzilla.zeta.org.au> 

next in thread | previous in thread | raw e-mail | index | archive | help
Bruce Evans wrote:
> >  Modified files:
> >    sys/i386/include     cpufunc.h 
> >  Log:
> >  Try using the builtin ffs() for egcs, it (by random inspection)
> >  generates slightly better code and avoids the incl then subl when
> >  using ffs(foo) - 1.
> 
> The inline asm version of ffs(x) should be implemented as
> (x == 0 ? 0 : bsfl(x) + 1).  The compiler can then perform all possible
> optimisations except ones that use the condition codes delivered by bsfl
> (these never seem to help).  This gives slightly better code than the
> builtin.

except the one where I want "bsfl(x)", not "bsfl(x) + 1",  With the cpufunc.h
inline, it works out as:

testl %eax,%eax; je 1f; bsfl %eax; addl $1,%eax; 1:
subl $1,%eax

ie: it can't optimize out the +1 -1.

How about this instead:

static __inline int
__bsfl(int mask)
{
        int result;

        __asm __volatile("bsfl %0,%0" : "=r" (result) : "0" (mask));
        return result;
}

static __inline int
ffs(int mask)
{
        return mask == 0 ? mask : __bsfl(mask) + 1;
}

Then, with the following code:

extern int bar(int);
int foo(int j)
{
        int i;

	if (j)
	        bar (ffs(j) - 1);
}

It gets optimized much better:

foo:
        movl 4(%esp),%eax
        testl %eax,%eax
        je .L6
#APP
        bsfl %eax,%eax
#NO_APP
        pushl %eax
        call bar
        addl $4,%esp
.L6:
        ret

Versus the original inline with your ffs macro:

foo:
        movl 4(%esp),%eax
        testl %eax,%eax
        je .L37
#APP
        testl %eax,%eax
	je 1f
	bsfl %eax,%eax
	incl %eax
1:
#NO_APP
        decl %eax
        pushl %eax
        call bar
        addl $4,%esp
.L37:
        ret

The redundant incl, decl isn't optimized and contains a duplicate (never
taken) test.

Using this same code with builtin_ffs() results in:
foo:
        movl 4(%esp),%eax
        testl %eax,%eax
        je .L37
        bsfl %eax,%eax
        pushl %eax
        call bar
        addl $4,%esp
.L37:
        ret

However, you're right. builtin_ffs() sucks when the argument is not known
ie: leave out the if (j), and it turns into:
foo:
        movl 4(%esp),%eax
        bsfl %eax,%edx
        jne .L37
        movl $-1,%edx
.L37:
        pushl %edx
        call bar
        addl $4,%esp
        ret

Which means bsfl is always called even for a zero arg.

Leaving out "if (j)" on my above code with my __bsfl() version results in:
foo:
        movl 4(%esp),%eax
        testl %eax,%eax
        je .L6
#APP
        bsfl %eax,%eax
#NO_APP
        incl %eax
.L6:
        decl %eax
        pushl %eax
        call bar
        addl $4,%esp
        ret
.. which is equivalent to your inline.  In all cases I've checked, the code
is either equivalent or better.  (ie: addl, subl optimized out)

Having ffs() be base 1 is a pest.  ffs0() (base 0) would be damn convenient
at times, considering the number of places 'ffs(foo) - 1' turns up.

> Bruce

Cheers,
-Peter



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe cvs-all" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19990819142850.C78D81C9F>