Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 13 Nov 2012 00:18:25 -0800
From:      Alfred Perlstein <bright@mu.org>
To:        Andre Oppermann <oppermann@networx.ch>
Cc:        "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>, Adrian Chadd <adrian@freebsd.org>, Peter Wemm <peter@wemm.org>
Subject:   Re: auto tuning tcp
Message-ID:  <50A20251.7010302@mu.org>
In-Reply-To: <50A1FF80.3040900@networx.ch>
References:  <50A0A0EF.3020109@mu.org> <50A0A502.1030306@networx.ch> <50A0B8DA.9090409@mu.org> <50A0C0F4.8010706@networx.ch> <EB2C22B5-C18D-4AC2-8694-C5C0D96C07B3@mu.org> <50A13961.1030909@networx.ch> <50A14460.9020504@mu.org> <50A1E2E7.3090705@mu.org> <50A1E47C.1030208@mu.org> <CAGE5yCoj1dL9w-EMMi8iYMTOq9uUUHmFg4rMY7aPneUBHBv67Q@mail.gmail.com> <50A1EC92.9000507@mu.org> <50A1FF80.3040900@networx.ch>

next in thread | previous in thread | raw e-mail | index | archive | help
On 11/13/12 12:06 AM, Andre Oppermann wrote:
> On 13.11.2012 07:45, Alfred Perlstein wrote:
>> On 11/12/12 10:23 PM, Peter Wemm wrote:
>>> On Mon, Nov 12, 2012 at 10:11 PM, Alfred Perlstein <bright@mu.org> 
>>> wrote:
>>>> On 11/12/12 10:04 PM, Alfred Perlstein wrote:
>>>>> On 11/12/12 10:48 AM, Alfred Perlstein wrote:
>>>>>> On 11/12/12 10:01 AM, Andre Oppermann wrote:
>>>>>>>
>>>>>>> I've already added the tunable "kern.maxmbufmem" which is in pages.
>>>>>>> That's probably not very convenient to work with.  I can change it
>>>>>>> to a percentage of phymem/kva.  Would that make you happy?
>>>>>>>
>>>>>> It really makes sense to have the hash table be some relation to 
>>>>>> sockets
>>>>>> rather than buffers.
>>>>>>
>>>>>> If you are hashing "foo-objects" you want the hash to be some 
>>>>>> relation to
>>>>>> the max amount of "foo-objects" you'll see, not backwards derived 
>>>>>> from the
>>>>>> number of "bar-objects" that "foo-objects" contain, right?
>>>>>>
>>>>>> Because we are hashing the sockets, right?   not clusters.
>>>>>>
>>>>>> Maybe I'm wrong?  I'm open to ideas.
>>>>>
>>>>> Hey Andre, the following patch is what I was thinking
>>>>> (uncompiled/untested), it basically rounds up the maxsockets to a 
>>>>> power of 2
>>>>> and replaces the default 512 tcb hashsize.
>>>>>
>>>>> It might make sense to make the auto-tuning default to a minimum 
>>>>> of 512.
>>>>>
>>>>> There are a number of other hashes with static sizes that could 
>>>>> make use
>>>>> of this logic provided it's not upside-down.
>>>>>
>>>>> Any thoughts on this?
>>>>>
>>>>> Tune the tcp pcb hash based on maxsockets.
>>>>> Be more forgiving of poorly chosen tunables by finding a closer power
>>>>> of two rather than clamping down to 512.
>>>>> Index: tcp_subr.c
>>>>> ===================================================================
>>>>
>>>> Sorry, GUI mangled the patch... attaching a plain text version.
>>>>
>>>>
>>> Wait, you want to replace a hash with a flat array?  Why even bother
>>> to call it a hash at that point?
>>>
>>>
>>
>> If you are concerned about the space/time tradeoff I'm pretty happy 
>> with making it 1/2, 1/4th, 1/8th
>> the size of maxsockets.  (smaller?)
>>
>> Would that work better?
>
> I'd go for 1/8 or even 1/16 with a lower bound of 512.  More than
> that is excessive.

I'm OK with 1/8.  All I'm really going for is trying to make it somewhat 
better than 512 when un-tuned.
>
>> The reason I chose to make it equal to max sockets was a space/time 
>> tradeoff, ideally a hash should
>> have zero collisions and if a user has enough memory for 250,000 
>> sockets, then surely they have
>> enough memory for 256,000 pointers.
>
> I agree in general.  Though not all large memory servers do serve a
> large amount of connections.  We have find a tradeoff here.
>
> Having a perfect hash would certainly be laudable.  As long as the
> average hash chain doesn't go beyond few entries it's not a problem.
>
>> If you strongly disagree then I am fine with a more conservative 
>> setting, just note that effectively
>> the hash table will require 1/2 the factor that we go smaller in 
>> additional traversals when we max
>> out the number of sockets.  Meaning if the table is 1/4 the size of 
>> max sockets, when we hit that
>> many tcp connections I think we'll see an order of average 2 linked 
>> list traversals to find a node.
>> At 1/8, then that number becomes 4.
>
> I'm fine with that and claim that if you expect N sockets that you
> would also increase maxfiles/sockets to N*2 to have some headroom.
That is a good point.
>
>> I recall back in 2001 on a PII400 with a custom webserver I wrote 
>> having a huge benefit by upping
>> this to 2^14 or maybe even 2^16, I forget, but suddenly my CPU went 
>> down a huge amount and I didn't
>> have to worry about a load balancer or other tricks.
>
> I can certainly believe that.  A hash size of 512 is no good if
> you have more than 4K connections.
>
> PS: Please note that my patch for mbuf and maxfiles tuning is not yet
> in HEAD, it's still sitting in my tcp_workqueue branch.  I still have
> to search for derived values that may get totally out of whack with
> the new scaling scheme.
>
This is cool!  Thank you for the feedback.

Would you like me to put this on a user branch somewhere for you to 
merge into your perf branch?

-Alfred



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50A20251.7010302>