From owner-freebsd-fs@FreeBSD.ORG Tue Dec 2 18:06:42 2003 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C1F6416A4CE; Tue, 2 Dec 2003 18:06:42 -0800 (PST) Received: from sploot.vicor-nb.com (sploot.vicor-nb.com [208.206.78.81]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5AB0F43FBD; Tue, 2 Dec 2003 18:06:36 -0800 (PST) (envelope-from kmarx@vicor.com) Received: from vicor.com (localhost [127.0.0.1]) by sploot.vicor-nb.com (8.12.8/8.12.8) with ESMTP id hB31xoab003876; Tue, 2 Dec 2003 17:59:50 -0800 (PST) (envelope-from kmarx@vicor.com) Message-ID: <3FCD4396.1040100@vicor.com> Date: Tue, 02 Dec 2003 17:59:50 -0800 From: Ken Marx User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.6a) Gecko/20031105 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Don Lewis References: <200311282135.hASLYveF018257@gw.catspoiler.org> In-Reply-To: <200311282135.hASLYveF018257@gw.catspoiler.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit cc: freebsd-fs@FreeBSD.org cc: mckusick@beastie.mckusick.com Subject: Re: 4.8 ffs_dirpref problem X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Dec 2003 02:06:42 -0000 Don Lewis wrote: > On 28 Nov, To: kmarx@vicor.com wrote: > >>On 24 Nov, Ken Marx wrote: >> >>> >>>Don Lewis wrote: >> >>>>Index: sys/kern/vfs_bio.c >>>>=================================================================== >>>>RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v >>>>retrieving revision 1.242.2.21 >>>>diff -u -r1.242.2.21 vfs_bio.c >>>>--- sys/kern/vfs_bio.c 9 Aug 2003 16:21:19 -0000 1.242.2.21 >>>>+++ sys/kern/vfs_bio.c 18 Nov 2003 02:10:55 -0000 >>>>@@ -140,6 +140,7 @@ >>>> &bufreusecnt, 0, ""); >>>> >>>> static int bufhashmask; >>>>+static int bufhashshift; >>>> static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash; >>>> struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } }; >>>> char *buf_wmesg = BUF_WMESG; >>>>@@ -160,7 +161,20 @@ >>>> struct bufhashhdr * >>>> bufhash(struct vnode *vnp, daddr_t bn) >>>> { >>>>- return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]); >>>>+ u_int64_t hashkey64; >>>>+ int hashkey; >>>>+ >>>>+ /* >>>>+ * Fibonacci hash, see Knuth's >>>>+ * _Art of Computer Programming, Volume 3 / Sorting and Searching_ >>>>+ * >>>>+ * We reduce the argument to 32 bits before doing the hash to >>>>+ * avoid the need for a slow 64x64 multiply on 32 bit platforms. >>>>+ */ >>>>+ hashkey64 = (u_int64_t)(uintptr_t)vnp + (u_int64_t)bn; >>>>+ hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 2654435769u) >> >>>>+ bufhashshift) & bufhashmask; >>>>+ return(&bufhashtbl[hashkey]); >>>> } >>>> >>>> /* >>>>@@ -319,8 +333,9 @@ >>>> bufhashinit(caddr_t vaddr) >>>> { >>>> /* first, make a null hash table */ >>>>+ bufhashshift = 29; >>>> for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1) >>>>- ; >>>>+ bufhashshift--; >>>> bufhashtbl = (void *)vaddr; >>>> vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask; >>>> --bufhashmask; >>>> >>>> >>> >>>Well, I'm mildly beflummoxed - I tried to compare hashtable preformance >>>between all three known versions of the hashing - legacy power of 2, >>>the Vicor ^= hash, and Don's fibonacci hash. >>> >>>Running with >>> >>> minifree = max( 1, avgifree / 4 ); >>> minbfree = max( 1, avgbfree ); >>> >>>all perform about the same, with no performance problems all >>>the way up to 100% disk capacity (didn't test into reserved space). >>> >>>Looking at instrumentation to show freq and avg depth of the >>>hash buckets, everything seems very calm (mainly because >>>we're not hitting the linear searching very often, I'd presume). >>> >>>I can't explain why I seemlingly got performance problems >>>with similar (identical) minbfree code previously. >>> >>>So, out of spite, I went back to >>> >>> minbfree = max( 1, avgbfree/4 ); >>> >>>This does hit the hashtable harder for the legacy version >>>and not so much for either new flavor. Here are a few >>>samplings of calling my dump routine from the debugger. >>>"avgdepth" really means 'search depth' since we use >>>the depth reached after finding a bp in gbincore. >>> >>>The line below such as, >>> >>> 0: avgdepth[1] cnt=801 >>> >>>means that 801 of the hashtable buckets had an avg search >>>depth of 1 at the time the debug routine was called. >>>The 'N:' prefix means the N-th unique non-zero such value. >>>So large cnt's for small []'d depth values means an efficient hash. >>> >>>I've edited out the details as much as possible. >>> >>>LEGACY: >>>-------- >>>Nov 24 13:34:54 oos0b /kernel: bh[442/0x1ba]: freq=2706110, avgdepth = 154 >>>... >>>Nov 24 13:34:54 oos0b /kernel: 0: avgdepth[1] cnt=1015 >>>Nov 24 13:34:54 oos0b /kernel: 1: avgdepth[2] cnt=7 >>>Nov 24 13:34:54 oos0b /kernel: 2: avgdepth[154] cnt=1 <- !! >>>Nov 24 13:34:54 oos0b /kernel: 3: avgdepth[3] cnt=1 >>> ----------- >>> >>>Nov 24 13:36:49 oos0b /kernel: bh[442/0x1ba]: freq=3416953, avgdepth = 141 >>>... >>>Nov 24 13:36:49 oos0b /kernel: 0: avgdepth[1] cnt=1017 >>>Nov 24 13:36:49 oos0b /kernel: 1: avgdepth[141] cnt=1 >>>Nov 24 13:36:49 oos0b /kernel: 2: avgdepth[2] cnt=6 >>> >>>VICOR x-or hashtable: >>>--------------------- >>>Nov 24 13:07:24 oos0b /kernel: 0: avgdepth[1] cnt=762 >>>Nov 24 13:07:24 oos0b /kernel: 1: avgdepth[2] cnt=259 >>>Nov 24 13:07:24 oos0b /kernel: 2: avgdepth[3] cnt=3 >>> ----------- >>> >>>Nov 24 13:08:07 oos0b /kernel: 0: avgdepth[1] cnt=744 >>>Nov 24 13:08:07 oos0b /kernel: 1: avgdepth[2] cnt=275 >>>Nov 24 13:08:07 oos0b /kernel: 2: avgdepth[3] cnt=5 >>> >>>FIBONACCI: >>>---------- >>>Nov 24 11:56:50 oos0b /kernel: 0: avgdepth[1] cnt=811 >>>Nov 24 11:56:50 oos0b /kernel: 1: avgdepth[3] cnt=88 >>>Nov 24 11:56:50 oos0b /kernel: 2: avgdepth[2] cnt=124 >>>Nov 24 11:56:50 oos0b /kernel: 3: avgdepth[0] cnt=1 >>> ----------- >>> >>>Nov 24 11:57:48 oos0b /kernel: 0: avgdepth[1] cnt=801 >>>Nov 24 11:57:48 oos0b /kernel: 1: avgdepth[3] cnt=93 >>>Nov 24 11:57:48 oos0b /kernel: 2: avgdepth[2] cnt=130 >>> >>>So, while this is far from analytically eshaustive, >>>it almost appears the fibonacci hash has more entries >>>of depth 3, while the Vicor one has more at depth 2. >>> >>>I'm happy to run more tests if you have ideas. I'm also fine >>>to cut bait and go with whatever you decide. It *seems* like >>>putting the fibonacci hash is prudent since the current hash >>>has been observed to be expensive. I had trouble proving this >>>unequivocally though. So, perhaps Don's minbfree fix is sufficient >>>after all. I'm tempted at this point to go with the 100% flavor. >> >>I think we're running into one of the weaknesses in the Fibonacci hash. >>There are a large number of hash entries for the cylinder group blocks, >>which are located at offsets which are multiples of 89 * 2^10 in your >>example, or something on the order of 2^16. The effect of this is for >>the cylinder group number to be hashed using the least significant bits >>of the hash multiplier, which don't work as well for distributing the >>hash values. I tried some of Knuth's suggestions, and got better >>results with the hash multiplier 0x9E376DB1u. The most significant 16 >>bits of the multplier are the same as the original constant, and the >>least significant bits act as a fraction in the desirable range of 1/3 >>to 3/7. Please give this new hash multiplier a try. > > > I went ahead and spun a new version of my patch with the new multiplier, > one other tweak to the formula, and updated comments. > > Index: sys/kern/vfs_bio.c > =================================================================== > RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v > retrieving revision 1.242.2.21 > diff -u -r1.242.2.21 vfs_bio.c > --- sys/kern/vfs_bio.c 9 Aug 2003 16:21:19 -0000 1.242.2.21 > +++ sys/kern/vfs_bio.c 28 Nov 2003 20:02:06 -0000 > @@ -140,6 +140,7 @@ > &bufreusecnt, 0, ""); > > static int bufhashmask; > +static int bufhashshift; > static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash; > struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } }; > char *buf_wmesg = BUF_WMESG; > @@ -160,7 +161,40 @@ > struct bufhashhdr * > bufhash(struct vnode *vnp, daddr_t bn) > { > - return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]); > + u_int64_t hashkey64; > + int hashkey; > + > + /* > + * A variation on the Fibonacci hash that Knuth credits to > + * R. W. Floyd, see Knuth's _Art of Computer Programming, > + * Volume 3 / Sorting and Searching_ > + * > + * We reduce the argument to 32 bits before doing the hash to > + * avoid the need for a slow 64x64 multiply on 32 bit platforms. > + * > + * sizeof(struct vnode) is 168 on i386, so toss some of the lower > + * bits of the vnode address to reduce the key range, which > + * improves the distribution of keys across buckets. > + * > + * The file system cylinder group blocks are very heavily > + * used. They are located at invervals of fbg, which is > + * on the order of 89 to 94 * 2^10, depending on other > + * filesystem parameters, for a 16k block size. Smaller block > + * sizes will reduce fpg approximately proportionally. This > + * will cause the cylinder group index to be hashed using the > + * lower bits of the hash multiplier, which will not distribute > + * the keys as uniformly in a classic Fibonacci hash where a > + * relatively small number of the upper bits of the result > + * are used. Using 2^16 as a close-enough approximation to > + * fpg, split the hash multiplier in half, with the upper 16 > + * bits being the inverse of the golden ratio, and the lower > + * 16 bits being a fraction between 1/3 and 3/7 (closer to > + * 3/7 in this case), that gives good experimental results. > + */ > + hashkey64 = ((u_int64_t)(uintptr_t)vnp >> 3) + (u_int64_t)bn; > + hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 0x9E376DB1u) >> > + bufhashshift) & bufhashmask; > + return(&bufhashtbl[hashkey]); > } > > /* > @@ -319,8 +353,9 @@ > bufhashinit(caddr_t vaddr) > { > /* first, make a null hash table */ > + bufhashshift = 29; > for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1) > - ; > + bufhashshift--; > bufhashtbl = (void *)vaddr; > vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask; > --bufhashmask; > > Hi, Sorry for the delay. Tested, then realized a stupid mistake, started over, panic'd the box, and finally tested. Panic was a bit unsettling. Happened when I tried to call my dump routine from the debugger. Might have been a typo for all I know(!?) - was doing quickly to keep benchmark script from bailing due to "Profiling timer expired" killing my timer. (That's another kettle of wax, I guess.) Other than that, the newer hashkey stuff does seem to improve the shallowness of the table. Below are some samples while doing the 1.4gb untar commands on 98% and 99%+ full disk. Again, running w/ 75% avgbfree, avgifree so as to hit the table harder. Let me know if this helps or you need something further. regards, k. -- Ken Marx, kmarx@vicor-nb.com We need to optimize and focus hard to resolve the long pole in the tent. - http://www.bigshed.com/cgi-bin/speak.cgi ------- snip -------- at 98% full: ========== ---------- Dec 2 16:03:44 oos0b /kernel: 0: avgdepth[0] cnt=602 Dec 2 16:03:44 oos0b /kernel: 1: avgdepth[1] cnt=61 Dec 2 16:03:44 oos0b /kernel: 2: avgdepth[2] cnt=187 ---------- Dec 2 16:04:59 oos0b /kernel: 0: avgdepth[1] cnt=1008 Dec 2 16:04:59 oos0b /kernel: 1: avgdepth[2] cnt=15 Dec 2 16:04:59 oos0b /kernel: 2: avgdepth[3] cnt=1 ---------- Dec 2 16:06:20 oos0b /kernel: 0: avgdepth[1] cnt=971 Dec 2 16:06:20 oos0b /kernel: 1: avgdepth[2] cnt=51 Dec 2 16:06:20 oos0b /kernel: 2: avgdepth[3] cnt=2 ---------- Dec 2 16:06:41 oos0b /kernel: 0: avgdepth[2] cnt=132 Dec 2 16:06:41 oos0b /kernel: 1: avgdepth[1] cnt=885 Dec 2 16:06:41 oos0b /kernel: 2: avgdepth[3] cnt=7 (then panic'd on next call to dump routine from debugger) ------------------------------------------------------------------------------- at 99% full: ============ ---------- Dec 2 17:41:36 oos0b /kernel: 0: avgdepth[2] cnt=211 Dec 2 17:41:36 oos0b /kernel: 1: avgdepth[1] cnt=808 Dec 2 17:41:36 oos0b /kernel: 2: avgdepth[3] cnt=5 ---------- Dec 2 17:42:12 oos0b /kernel: 0: avgdepth[2] cnt=68 Dec 2 17:42:13 oos0b /kernel: 1: avgdepth[1] cnt=954 Dec 2 17:42:13 oos0b /kernel: 2: avgdepth[3] cnt=2 ---------- Dec 2 17:43:17 oos0b /kernel: 0: avgdepth[1] cnt=965 Dec 2 17:43:17 oos0b /kernel: 1: avgdepth[2] cnt=57 Dec 2 17:43:17 oos0b /kernel: 2: avgdepth[3] cnt=2 ---------- Dec 2 17:43:28 oos0b /kernel: 0: avgdepth[2] cnt=124 Dec 2 17:43:28 oos0b /kernel: 1: avgdepth[1] cnt=897 Dec 2 17:43:28 oos0b /kernel: 2: avgdepth[3] cnt=3 ---------- Dec 2 17:43:43 oos0b /kernel: 0: avgdepth[2] cnt=183 Dec 2 17:43:43 oos0b /kernel: 1: avgdepth[1] cnt=833 Dec 2 17:43:43 oos0b /kernel: 2: avgdepth[3] cnt=8 ---------- Dec 2 17:44:04 oos0b /kernel: 0: avgdepth[2] cnt=201 Dec 2 17:44:04 oos0b /kernel: 1: avgdepth[1] cnt=818 Dec 2 17:44:04 oos0b /kernel: 2: avgdepth[3] cnt=5 ---------- Dec 2 17:44:25 oos0b /kernel: 0: avgdepth[2] cnt=205 Dec 2 17:44:25 oos0b /kernel: 1: avgdepth[1] cnt=813 Dec 2 17:44:25 oos0b /kernel: 2: avgdepth[3] cnt=6 ---------- Dec 2 17:45:39 oos0b /kernel: 0: avgdepth[2] cnt=216 Dec 2 17:45:39 oos0b /kernel: 1: avgdepth[1] cnt=802 Dec 2 17:45:39 oos0b /kernel: 2: avgdepth[3] cnt=6 ---------- Dec 2 17:45:44 oos0b /kernel: 0: avgdepth[2] cnt=215 Dec 2 17:45:44 oos0b /kernel: 1: avgdepth[1] cnt=802 Dec 2 17:45:44 oos0b /kernel: 2: avgdepth[3] cnt=7 ---------- Dec 2 17:46:15 oos0b /kernel: 0: avgdepth[2] cnt=212 Dec 2 17:46:15 oos0b /kernel: 1: avgdepth[1] cnt=804 Dec 2 17:46:15 oos0b /kernel: 2: avgdepth[3] cnt=8 ---------- Dec 2 17:47:51 oos0b /kernel: 0: avgdepth[2] cnt=202 Dec 2 17:47:51 oos0b /kernel: 1: avgdepth[1] cnt=818 Dec 2 17:47:51 oos0b /kernel: 2: avgdepth[3] cnt=4 ---------- Dec 2 17:48:05 oos0b /kernel: 0: avgdepth[2] cnt=210 Dec 2 17:48:05 oos0b /kernel: 1: avgdepth[1] cnt=809 Dec 2 17:48:05 oos0b /kernel: 2: avgdepth[3] cnt=5 ---------- Dec 2 17:48:18 oos0b /kernel: 0: avgdepth[2] cnt=214 Dec 2 17:48:18 oos0b /kernel: 1: avgdepth[1] cnt=802 Dec 2 17:48:18 oos0b /kernel: 2: avgdepth[3] cnt=8 ---------- Dec 2 17:48:33 oos0b /kernel: 0: avgdepth[2] cnt=215 Dec 2 17:48:33 oos0b /kernel: 1: avgdepth[1] cnt=801 Dec 2 17:48:33 oos0b /kernel: 2: avgdepth[3] cnt=8 ---------- Dec 2 17:48:50 oos0b /kernel: 0: avgdepth[2] cnt=229 Dec 2 17:48:50 oos0b /kernel: 1: avgdepth[1] cnt=785 Dec 2 17:48:50 oos0b /kernel: 2: avgdepth[3] cnt=10 ---------- Dec 2 17:48:57 oos0b /kernel: 0: avgdepth[2] cnt=231 Dec 2 17:48:57 oos0b /kernel: 1: avgdepth[1] cnt=781 Dec 2 17:48:57 oos0b /kernel: 2: avgdepth[3] cnt=12 ---------- Dec 2 17:49:11 oos0b /kernel: 0: avgdepth[2] cnt=231 Dec 2 17:49:11 oos0b /kernel: 1: avgdepth[1] cnt=781 Dec 2 17:49:11 oos0b /kernel: 2: avgdepth[3] cnt=12 From owner-freebsd-fs@FreeBSD.ORG Wed Dec 3 13:51:44 2003 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4D1AA16A4CE for ; Wed, 3 Dec 2003 13:51:44 -0800 (PST) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id AC66043FCB for ; Wed, 3 Dec 2003 13:51:42 -0800 (PST) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.9p2/8.12.9) with ESMTP id hB3LpReF032726; Wed, 3 Dec 2003 13:51:31 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <200312032151.hB3LpReF032726@gw.catspoiler.org> Date: Wed, 3 Dec 2003 13:51:27 -0800 (PST) From: Don Lewis To: kmarx@vicor.com In-Reply-To: <3FCD4396.1040100@vicor.com> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: freebsd-fs@FreeBSD.org cc: mckusick@beastie.mckusick.com Subject: Re: 4.8 ffs_dirpref problem X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Dec 2003 21:51:44 -0000 On 2 Dec, Ken Marx wrote: > > > Don Lewis wrote: >> I went ahead and spun a new version of my patch with the new multiplier, >> one other tweak to the formula, and updated comments. > > Hi, Sorry for the delay. Tested, then realized a stupid mistake, > started over, panic'd the box, and finally tested. Panic was > a bit unsettling. Happened when I tried to call my dump routine > from the debugger. Might have been a typo for all I know(!?) - > was doing quickly to keep benchmark script from bailing due to > "Profiling timer expired" killing my timer. (That's another kettle > of wax, I guess.) > > Other than that, the newer hashkey stuff does seem to improve > the shallowness of the table. Below are some samples while doing > the 1.4gb untar commands on 98% and 99%+ full disk. Again, > running w/ 75% avgbfree, avgifree so as to hit the table harder. It looks good enough to me. I just committed it.