Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 23 Oct 2003 16:53:39 -0700
From:      Ken Marx <kmarx@vicor.com>
To:        Kirk McKusick <mckusick@beastie.mckusick.com>
Cc:        Julian Elischer <julian@vicor.com>
Subject:   Re: 4.8 ffs_dirpref problem
Message-ID:  <3F986A03.2050809@vicor.com>
In-Reply-To: <200310231946.h9NJkQeN007683@beastie.mckusick.com>
References:  <200310231946.h9NJkQeN007683@beastie.mckusick.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Ok, thanks, Kirk. Re newfs'ing and re-doing our test is on the
todo list. Probably an overnight thing.

Meanwhile we did a bit more digging and, maybe, found an anomaly:

We did a few escapes to ddb while the perfomance was bad to
see what a typical stack was:

--- interrupt, eip = 0xc01d9af4, esp = 0xcfe24bf8, ebp = 0xcfe24c04 ---
gbincore(cf3c6d00,1d090040,cfe24ca8,401,0) at gbincore+0x34
getblk(cf3c6d00,1d090040,1000,0,0) at getblk+0x80
bread(cf3c6d00,1d090040,1000,0,cfe24ca8) at bread+0x27
ffs_alloccg(c21eaf00,1d09,0,800) at ffs_alloccg+0x70
ffs_hashalloc(c21eaf00,1908,6420008,800,c026f110) at ffs_hashalloc+0x8c
ffs_alloc(c21eaf00,0,6420008,800,c1f93080) at ffs_alloc+0xc9
ffs_balloc(cfe24e2c,cfc9da40,c203bd80,20001,cfccfde0) at ffs_balloc+0x46a
ffs_write(cfe24e64,c203bd80,cf9934e0,41b,c03695a0) at ffs_write+0x319
vn_write(c203bd80,cfe24ed4,c1f93080,0,cf9934e0) at vn_write+0x15e
dofilewrite(cf9934e0,c203bd80,4,809d200,41b) at dofilewrite+0xc1
write(cf9934e0,cfe24f80,41b,809d200,0) at write+0x3b
---------------

So, alloccg logic needs to get the cg block. It goes through
getblk which in turn looks to see if the block is alredy in
an in-mem hashtable via the lookup routine, gbincore.

Julian had the thought that perhaps there was something funny
about this hash table. Possible wrt to cg blocks.

So, we hacked in a frew routines to histogram how often each
bucket was searched, and the 'average depth' of the bucket.
(This crude average is total running sum of depths found over all times
bucket was searched, divided by total times bucket was searched.)

We found that block numbers really spike at bucket 250, and
that the avg-depth of that bucket is 10-100 times that of any other
over the total of 1023 buckets in the hash:

 bh[247]: freq=1863, avgdepth = 1
 bh[248]: freq=1860, avgdepth = 1
 bh[249]: freq=1777, avgdepth = 1
 bh[250]: freq=969100, avgdepth = 440
 bh[251]: freq=1595, avgdepth = 12
 bh[252]: freq=1437, avgdepth = 1

To verify that these were cg block lookups we did a
similar histogram of hash indexes for the actual
bread() calls in ffs_alloccg. That is the bucket
that would be hashed for

	(ip->i_devvp, fsbtodb(fs, cgtod(fs, cg))

We got similar, corroborating results:

 bh[248]: freq=0
 bh[249]: freq=0
 bh[250]: freq=662387
 bh[251]: freq=0
 bh[252]: freq=40
 bh[253]: freq=0

It appears that lookups for cg blocks (that are probably
in memory already) tend to be more costly than necessary(?).

So, it may be that a better tuned file system would likely help.
But is it also possible that tuning wouldn't be needed if
the hash table were more evenly distributed?

We can dump the block list for the anomalous hashtable
bucket if you wish. And/or any other info/suggestions you
have for that matter. Maybe we'll hack in a new hashing
function just for kicks to see what happens...

Thanks again for your time!

k
	

Kirk McKusick wrote:
> 	Date: Thu, 23 Oct 2003 11:08:02 -0700
> 	From: Ken Marx <kmarx@vicor.com>
> 	To: Julian Elischer <julian@vicor.com>
> 	CC: mckusick@mckusick.com, cburrell@vicor.com, davep@vicor.com,
> 		freebsd-fs@freebsd.org, gluk@ptci.ru, jpl@vicor.com,
> 		jrh@vicor.com, julian@vicor-nb.com, VicPE@aol.com
> 	Subject: Re: 4.8 ffs_dirpref problem
> 	X-ASK-Info: Confirmed by User
> 
> 	Thanks for the reply,
> 
> 	We actually *did* try -s 4096 yesterday (not quite what you
> 	suggested) with spotty results: Sometimes it seemed to go
> 	more quickly, but often not.
> 
> 	Let me clarify our test: We have a 1.5gb tar file from our
> 	production raid that fairly represents the distribution of
> 	data. We hit the performance problem when we get to dirs
> 	with lots of small-ish files.  But, as Julian mentioned,
> 	we typically have many flavors of file sizes and populations.
> 
> 	Admittedly, our untar'ing test isn't necessarily representitive
> 	of what happens in production - we were just trying to fill
> 	the disk and recreate the problem here. We *did* at least
> 	hit a noticeable problem, and we believe it's the same
> 	behavior that's hitting production.
> 
> 	I just tried your exact suggested settings on an fs that
> 	was already 96% full, and still experienced the very sluggish
> 	behavior on exactly the same type of files/dirs.
> 
> 	Our untar typically takes around 60-100 sec of system time
> 	when things are going ok; 300-1000+ sec when the sluggishness
> 	occurs.  This time tends to increase as we get closer to
> 	99%. Sometimes as high as 4000+ secs.
> 
> 	I wasn't clear from your mail if I should newfs the entire
> 	fs and start over, or if I could have expected the settings
> 	to make a difference for any NEW data.
> 
> 	I can do this latter if you think it's required. The test
> 	will then take several hours to run since we need at least
> 	85% disk usage to start seeing the problem.
> 
> 	Thanks!
> 	k
> 
> Unfortunately, I do believe that you will need to start over from
> scratch with a newfs. The problem is that by the time you are at
> 85% full with the old parameters, the directory structure is already
> too "dense" forcing you to search far and wide for more inodes. If
> you start from the beginning with a large filesperdir then your
> directory structure will expand across more of the disk which
> should approximate the old algorithm.
> 
> 	Kirk McKusick
> 
> 

-- 
Ken Marx, kmarx@vicor-nb.com
It's an orthogonal issue to leverage our critical resources and focus hard to 
resolve the market forces.
		- http://www.bigshed.com/cgi-bin/speak.cgi



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3F986A03.2050809>