Date: Thu, 18 Jun 2009 20:29:31 GMT From: Jay Patrick Howard <jhoward@alumni.utexas.net> To: freebsd-gnats-submit@FreeBSD.org Subject: kern/135718: [PATCH] enhance qsort to properly handle 32-bit aligned data on 64-bit systems Message-ID: <200906182029.n5IKTVef007120@www.freebsd.org> Resent-Message-ID: <200906182030.n5IKU4HE065187@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 135718 >Category: kern >Synopsis: [PATCH] enhance qsort to properly handle 32-bit aligned data on 64-bit systems >Confidential: no >Severity: serious >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Thu Jun 18 20:30:04 UTC 2009 >Closed-Date: >Last-Modified: >Originator: Jay Patrick Howard >Release: n/a >Organization: >Environment: >Description: The stdlib qsort() code in FreeBSD is largely taken from the paper "Engineering a Sort Function" by Bentley & McIlroy (1993). In that paper, the authors employ a clever a scheme for selecting a method of swapping elements. Basically it goes like this: 1. If the base pointer or element size is not aligned to sizeof(long) then swap byte-by-byte, else 2. If the element size is exactly sizeof(long) then perform a single inline swap, else 3. perform a long-by-long swap. The implicit assumption here is that if the base pointer or element size isn't aligned to sizeof(long) then one can't do any better than a char-by-char swap. While this seems to be true on most 32-bit systems, it is not the case on at least some 64-bit systems. In particular, x86-64. Consequently, sorting data that is 32-bit aligned (but not 64-bit aligned) is much slower on 64-bit systems compared to 32-bit systems. This is because in 32-bit mode the qsort() logic uses a long-by-long swap (since the data is aligned) while in 64-bit mode qsort() drops down to a char-by-char swap. It is true that most workloads on 64-bit systems will be 64-bit aligned. However, it is fairly common for 64-bit code to need to process binary data that wasn't generated on a 64-bit system and hence may not be 64-bit aligned. Because this is fairly common it could be a decent "win" for qsort() to support fast swapping for such 32-bit aligned workloads. In my testing, sorting 64 MB worth of 100-byte records (each with a 12-byte key) took 1.8x as long on a 64-bit system as it did on a 32-bit system, with identical hardware. With a patched qsort the performance was identical on both 32-bit and 64-bit versions of the code. The patch is written such that if sizeof(long) == sizeof(int) then it acts exactly like the current version. The additional swap behavior is only enabled when sizeof(long) > sizeof(int). The extra overhead from the sizeof(int) alignment check was negligible. Given the way SWAPINIT is structured, there is no additional overhead whatsoever when the data is 64-bit aligned. The only time additional overhead is incurred is when data is NOT 64-bit aligned, in which case the extra alignment check quite is likely to provide a significant speedup. >How-To-Repeat: Sort records of size 8*n+4 bytes from within 64-bit code. Then sort the same data from within 32-bit code. The 64-bit version should take approximately 1.8x as long to execute. >Fix: See attached patch modifying /src/lib/libc/stdlib/qsort.c Patch attached with submission follows: --- qsort.c 2009-06-18 13:32:58.000000000 -0500 +++ qsort.c.patched 2009-06-18 14:22:02.000000000 -0500 @@ -34,6 +34,7 @@ __FBSDID("$FreeBSD: src/lib/libc/stdlib/qsort.c,v 1.15 2008/01/14 09:21:34 das Exp $"); #include <stdlib.h> +#include <limits.h> #ifdef I_AM_QSORT_R typedef int cmp_t(void *, const void *, const void *); @@ -59,8 +60,15 @@ } while (--i > 0); \ } -#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) || \ +#if LONG_BIT > WORD_BIT +#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) || \ + es % sizeof(long) ? ((char *)a - (char *)0) % sizeof(int) || es % \ + sizeof(int) ? 3 : 2 : es == sizeof(long)? 0 : 1; +#else +#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) || \ es % sizeof(long) ? 2 : es == sizeof(long)? 0 : 1; +#endif + static inline void swapfunc(a, b, n, swaptype) @@ -69,6 +77,10 @@ { if(swaptype <= 1) swapcode(long, a, b, n) +#if LONG_BIT > WORD_BIT + else if(swaptype <= 2) + swapcode(int, a, b, n) +#endif else swapcode(char, a, b, n) } >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200906182029.n5IKTVef007120>