Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 18 Jun 2009 20:29:31 GMT
From:      Jay Patrick Howard <jhoward@alumni.utexas.net>
To:        freebsd-gnats-submit@FreeBSD.org
Subject:   kern/135718: [PATCH] enhance qsort to properly handle 32-bit aligned data on 64-bit systems
Message-ID:  <200906182029.n5IKTVef007120@www.freebsd.org>
Resent-Message-ID: <200906182030.n5IKU4HE065187@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

>Number:         135718
>Category:       kern
>Synopsis:       [PATCH] enhance qsort to properly handle 32-bit aligned data on 64-bit systems
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Jun 18 20:30:04 UTC 2009
>Closed-Date:
>Last-Modified:
>Originator:     Jay Patrick Howard
>Release:        n/a
>Organization:
>Environment:
>Description:
The stdlib qsort() code in FreeBSD is largely taken from the paper "Engineering a Sort Function" by Bentley & McIlroy (1993).

In that paper, the authors employ a clever a scheme for selecting a method of swapping elements.  Basically it goes like this:

1. If the base pointer or element size is not aligned to sizeof(long) then swap byte-by-byte, else
2. If the element size is exactly sizeof(long) then perform a single inline swap, else
3. perform a long-by-long swap.

The implicit assumption here is that if the base pointer or element size isn't aligned to sizeof(long) then one can't do any better than a char-by-char swap.

While this seems to be true on most 32-bit systems, it is not the case on at least some 64-bit systems.  In particular, x86-64.

Consequently, sorting data that is 32-bit aligned (but not 64-bit aligned) is much slower on 64-bit systems compared to 32-bit systems.  This is because in 32-bit mode the qsort() logic uses a long-by-long swap (since the data is aligned) while in 64-bit mode qsort() drops down to a char-by-char swap.

It is true that most workloads on 64-bit systems will be 64-bit aligned.  However, it is fairly common for 64-bit code to need to process binary data that wasn't generated on a 64-bit system and hence may not be 64-bit aligned.  Because this is fairly common it could be a decent "win" for qsort() to support fast swapping for such 32-bit aligned workloads.

In my testing, sorting 64 MB worth of 100-byte records (each with a 12-byte key) took 1.8x as long on a 64-bit system as it did on a 32-bit system, with identical hardware.  With a patched qsort the performance was identical on both 32-bit and 64-bit versions of the code.

The patch is written such that if sizeof(long) == sizeof(int) then it acts exactly like the current version.  The additional swap behavior is only enabled when sizeof(long) > sizeof(int).

The extra overhead from the sizeof(int) alignment check was negligible.  Given the way SWAPINIT is structured, there is no additional overhead whatsoever when the data is 64-bit aligned.  The only time additional overhead is incurred is when data is NOT 64-bit aligned, in which case the extra alignment check quite is likely to provide a significant speedup.
>How-To-Repeat:
Sort records of size 8*n+4 bytes from within 64-bit code.  Then sort the same data from within 32-bit code.  The 64-bit version should take approximately 1.8x as long to execute.
>Fix:
See attached patch modifying /src/lib/libc/stdlib/qsort.c

Patch attached with submission follows:

--- qsort.c	2009-06-18 13:32:58.000000000 -0500
+++ qsort.c.patched	2009-06-18 14:22:02.000000000 -0500
@@ -34,6 +34,7 @@
 __FBSDID("$FreeBSD: src/lib/libc/stdlib/qsort.c,v 1.15 2008/01/14 09:21:34 das Exp $");
 
 #include <stdlib.h>
+#include <limits.h>
 
 #ifdef I_AM_QSORT_R
 typedef int		 cmp_t(void *, const void *, const void *);
@@ -59,8 +60,15 @@
         } while (--i > 0);				\
 }
 
-#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) || \
+#if LONG_BIT > WORD_BIT
+#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) ||	\
+	es % sizeof(long) ? ((char *)a - (char *)0) % sizeof(int) || es %	\
+	sizeof(int) ? 3 : 2 : es == sizeof(long)? 0 : 1;
+#else
+#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) ||	\
 	es % sizeof(long) ? 2 : es == sizeof(long)? 0 : 1;
+#endif
+
 
 static inline void
 swapfunc(a, b, n, swaptype)
@@ -69,6 +77,10 @@
 {
 	if(swaptype <= 1)
 		swapcode(long, a, b, n)
+#if LONG_BIT > WORD_BIT
+	else if(swaptype <= 2)
+		swapcode(int, a, b, n)
+#endif
 	else
 		swapcode(char, a, b, n)
 }


>Release-Note:
>Audit-Trail:
>Unformatted:



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200906182029.n5IKTVef007120>