From owner-freebsd-hackers@FreeBSD.ORG Fri Oct 31 06:13:56 2003 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E236216A4CE for ; Fri, 31 Oct 2003 06:13:56 -0800 (PST) Received: from proto.pvv.ntnu.no (proto.pvv.ntnu.no [129.241.210.168]) by mx1.FreeBSD.org (Postfix) with SMTP id 428E043FBD for ; Fri, 31 Oct 2003 06:13:55 -0800 (PST) (envelope-from markusk@pvv.ntnu.no) Received: (qmail 67616 invoked by uid 28724); 31 Oct 2003 14:13:53 -0000 Date: Fri, 31 Oct 2003 15:13:53 +0100 (CET) From: =?ISO-8859-1?Q?Markus_B_Kr=FCger?= X-X-Sender: markusk@proto.pvv.ntnu.no To: freebsd-hackers@freebsd.org Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Subject: Suggested improvement to radixsort() X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Oct 2003 14:13:57 -0000 (I hope this is the correct forum for this post; if not, please let me know where it should go.) I've found a possible improvement to the implementation of radixsort() located in /usr/src/lib/libc/stdlib/radixsort.c, which will make the implementation more efficient when sorting data sets containing many strings that share a common prefix. The improvement consists in checking for bins where all strings have the same character at the current position of the radix sort, and moving on to the next character instead of needlessly shuffling the strings in the bin around. When I used the modified radixsort() to sort a web log file, which typically contains many strings with common prefixes (IP addresses), I registered a speedup of approximately 15%. On a constructed dataset where all strings had a long common prefix, the speedup was 50%. My modifications are listed below in patch format. Corrections, comments and questions are welcome. Is this patch something that I should send with send-pr, or should it be submitted elsewhere? --- begin radixsort.c.patch --- *** radixsort.c=09Tue Mar 11 12:39:57 1997 --- radixsort.c.patch=09Fri Oct 31 12:55:10 2003 *************** *** 175,180 **** --- 175,191 ---- =09=09} =09=09/* + =09=09 * Special case: if all strings have the same + =09=09 * character at position i, move on to the next + =09=09 * character. + =09=09 */ + =09=09if (nc =3D=3D 1 && count[bmin] =3D=3D n) { + =09=09=09push(a, n, i+1); + =09=09=09nc =3D count[bmin] =3D 0; + =09=09=09continue; + =09=09} + + =09=09/* =09=09 * Set top[]; push incompletely sorted bins onto stack. =09=09 * top[] =3D pointers to last out-of-place element in bins. =09=09 * count[] =3D counts of elements in bins. --- end radixsort.c.patch --- --=20 ,------------------- Markus Bjartveit Kr=FCger ---------------------. ' ` ` E-mail: markusk@pvv.org WWW: http://www.pvv.org/~markusk/ ' )-------------------------------------------------------------------(