From owner-freebsd-numerics@FreeBSD.ORG Wed May 22 05:03:04 2013 Return-Path: Delivered-To: numerics@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 29C12748; Wed, 22 May 2013 05:03:04 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx08.syd.optusnet.com.au (fallbackmx08.syd.optusnet.com.au [211.29.132.10]) by mx1.freebsd.org (Postfix) with ESMTP id A710888A; Wed, 22 May 2013 05:03:02 +0000 (UTC) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by fallbackmx08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r4M4tUj8008996; Wed, 22 May 2013 14:55:30 +1000 Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r4M4tJJn031018 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 22 May 2013 14:55:21 +1000 Date: Wed, 22 May 2013 14:55:19 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: numerics@freebsd.org Subject: extra-precision bugs in clang on i386 even with __SSE*_MATH__ Message-ID: <20130522131618.M1038@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=BPvrNysG c=1 sm=1 a=OBzK9IdsZ0wA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=SnWk4Hs31W4A:10 a=7jRFLMzktCi4NJO2apkA:9 a=CjuIK1q_8ugA:10 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117 Cc: dim@freebsd.org X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 May 2013 05:03:04 -0000 clang with certain march= in CFLAGS uses SSE for double and/or float operations even on i386 although the ABI doesn't really allow this, and sets __SSE2_MATH__ and/or __SSE_MATH__ to indicate this. It is well known that this breaks the definitions of float_t, double_t and FLT_EVAL_METHOD, because FreeBSD headers haven't been updated to support clang; in particular they know nothing of __SSE*_MATH__. I recently noticed a related bug that is not due to missing support in FreeBSD. Like gcc, clang doesn't discard extra precision on casts or assignments, as is fuzzily specified in C90 and explicitly specified in C99). Using SSE breaks extra precision by preventing it from occurring in most cases, but it can still occur if a double or float is somehow allocated in an i387 register. This occurs mainly for function calls, because the ABI requires FP functions to return results in extra-precision registers, and it is normal and normally good to actually return extra-precision results. clang then fails to clip the extra precision even when explicitly requested to do so using a cast or assignment. E.g. @ #include @ #include @ @ static double tiny = 0x1p-31; @ @ int @ main(void) @ { @ printf("tiny = %a\n", tiny); @ printf("cos(tiny) may have extra precision, and is %La\n", @ (long double)cos(tiny)); @ printf("(double)cos(tiny) should be 1, but is %La\n", @ (long double)(double)cos(tiny)); @ return (0); @ } This works correctly on amd64 (amd64 happens not to use i386 fcos, but even if it did then the extra precision in the fcos result would be clipped by the ABI returning the result in an xmm regiser), but on i386 it gives: @ tiny = 0x1p-31 @ cos(tiny) may have extra precision, and is 0x1.fffffffffffffffcp-1 @ (double)cos(tiny) should be 1, but is 0x1.fffffffffffffffcp-1 Both results are just (1 - (long double)tiny * tiny / 2) rounded to long double precision. It is normally good for cos(tiny) to have extra precision, so that you can use it in double precision expressions that produce extra-precision results according to FLT_EVAL_METHOD. Without this, you can't write even the squaring operation x*x as a function that returns x*x, without the function being unnecessarily less accurate and giving different results than the expression x*x or a function-like macro that gives x*x. Similarly for all functions that return double or float. However, sometimes you need to clip the extra precision, and compiler bugs make this difficult to do in a standard way. Normally the compiler bugs are features because the clipping operation is slow, and naive code that assigns intermediate results to double and float variables will lose precision and speed unnecessarily (non-naive code must use float_t and double_t for almost everything and clip these only when necessary. This works well except across function calls). FreeBSD libm uses the STRICT_ASSIGN() macro to do clipping. It only does explicit clipping if FLT_EVAL_METHOD != 0 (or !__GNUC__). The present bug prevents fixing FLT_EVAL_METHOD to be 0 with clang using SSE*, since that would break STRICT_ASSIGN() in the case where the rvalue has extra precision because it is allocated in a register. It is difficult to make STRICT_ASSIGN() depend on the register allocation even in a nonstandard way. C11 breaks this area even more. It specifies that extra precision is always clipped on return, at least in C functions. This breaks intentionally returning extra precision for accuracy, and more seriously it breaks efficiency by requiring the slow clipping operation on every return (on x86, the clipping operation takes about as long as 2 serially-dependent addition operations and stalls pipelines due to its serial dependencies). Buggy compilers should ignore this buggy specification like they already ignore the buggy specification to clip precision on casts and assignments. Clipping on every assignment is even more inefficient because assignments are more common than returns. However, there needs to be a standard way to clip (just casting?). I don't know if this applies to library functions implemented in another language. If it does, then all libraries that use raw hardware functions like i387 fcos are non-conformant, and must be weakened for conformance. C99's bugs in this area give the bizarre requirement that extra precision must be clipped on return if and only if it was gained intentionally (literally, if and only if it is explicit, but implementations that generate it intentionally will probably have it explicitly). The C11 regression is a apparently a reaction to this bizarreness. Examples: 1. double sq(double x) { return (x * x); } 2. double sq(double x) { return ((long double)x * x); } 3. double sq(double x) { return ((double_t)x * x); } (1) may evaluate x*x in implicit extra precision according FLT_EVAL_METHOD. If it did, then it should return the extra precision. C99 permits this, but C11 doesn't. (2) evaluates x*x in explicit higher precision (which is strictly higher iff long double is strictly larger). Neither C99 nor C11 permit returning the extra precision, if any. (3) evaluates x*x in explicit minimally-for-efficiency extra precision. Neither C99 nor C11 permit returning the extra precision, if any. So the most careful code that uses double_t to keep as much extra precision as possible (subject to efficiency) is required to waste time to kill this care at function call boundaries. I use a fake null cast of double_t to double (using null asm) to avoid the losses in (3). This works with gcc but not with clang. clang seemed to implement part of the C11 breakage and added clipping code even for converting a type to the same type. In numeric terms, for a certain version of logf on certain hardware, the clipping costs were approximately: - gcc version on i386 with no clipping: ~35 cycles/call - clang version on i386 with clipping: ~35+8 cycles/call - gcc and clang versions on amd64: ~35+8 cycles/call amd64 was slower because the ABI enforces clipping. The cycle counts were not precisely the same for the clipping versions on the different arches, but the clipping cost is 8 cycles on both and this dominates all the other differences (using SSE and a cleaner ABI on amd64 makes remarkbly little difference). Investigating the clang difference now showed futher bizarreness: - neither clang nor gcc do the clipping required by C99 for (2) or (3). This is apparently because thet know that upcasting x doesn't change it but don't know that upcasting x may change the result of the expression. - however, both gcc and clang do the clipping for: (3'). double prod(double x, long double y) { return (x * y); } Here is my asm hack that doesn't work for clang: @ /* @ * Convert a float_t to a float without losing extra precision and/or @ * wasting time, if possible. This is a sort of inverse for @ * STRICT_ASSIGN(). It is typically used to avoid C's bizarre return @ * semantics, which require that extra precision intentionally obtained by @ * using float_t within the function must be lost on return, while extra @ * precision accidentally obtained by using float within the function may @ * be kept on return. @ */ @ static inline float @ hackfloat_t(__float_t ft) @ { @ #if defined(__i386__) && !defined(__SSE_MATH__) @ float f; @ @ __asm("" : "=t" (f) : "0" (ft)); @ return (f); @ #else @ return (ft); @ #endif @ } @ @ /* Similarly for double_t. */ @ static inline double @ hackdouble_t(__double_t dt) @ { @ #if defined(__i386__) && !defined(__SSE2_MATH__) @ double d; @ @ __asm("" : "=t" (d) : "0" (dt)); @ return (d); @ #else @ return (dt); @ #endif @ } @ Changing (x * y) to hackdouble_t(x * y) fixes the lost accuracy and speed only for gcc. Bruce