Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 23 May 2013 16:44:00 +0000
From:      "Bentkofsky, Michael" <MBentkofsky@verisign.com>
To:        Jeff Roberson <jroberson@jroberson.net>, John Baldwin <jhb@freebsd.org>
Cc:        "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>, "jeff@freebsd.org" <jeff@freebsd.org>, "rwatson@freebsd.org" <rwatson@freebsd.org>, "Charbon, Julien" <jcharbon@verisign.com>
Subject:   RE: Followup from Verisign after last week's developer summit
Message-ID:  <080FBD5B7A09F845842100A6DE79623321F703B5@BRN1WNEXMBX01.vcorp.ad.vrsn.com>
In-Reply-To: <alpine.BSF.2.00.1305211846470.2005@desktop>
References:  <080FBD5B7A09F845842100A6DE79623321F6E70C@BRN1WNEXMBX01.vcorp.ad.vrsn.com> <201305211320.26818.jhb@freebsd.org> <alpine.BSF.2.00.1305211204360.2005@desktop> <alpine.BSF.2.00.1305211846470.2005@desktop>

next in thread | previous in thread | raw e-mail | index | archive | help
I am adding freebsd-net to this and will re-summarize to get additional inp=
ut. Thanks for all of the initial suggestions.

For benefit of those on freebsd-net@, we are noticing significant locking c=
ontention on the V_tcpinfo lock under moderately high connection establishm=
ent and teardown rates (around 45-50k connections per second). Our profilin=
g suggests the lock contention on V_tcpinfo effectively single-threads all =
TCP connections. Similar testing on Linux with equivalent hardware does not=
 show this contention and can get a much higher connection establishment ra=
te. We can attach profiling and test details if anyone would like.

JHB recommends:
- He has seen similar results in other kinds of testing.=20
- Linux uses RCU for the locking on the equivalent table (we've confirmed t=
his to be the case).
- Looking into a lock per bucket on the PCB lookup.

Jeff recommends:
- Changing the lock strategy so the hash lookup can be effectively pushed f=
urther down into the stack.
- Making the [list] iterators more complex like those in use in the hash lo=
okup now.

We are starting down these paths to try to break the locking down. We'll po=
st some initial patch ideas soon. Meanwhile, any additional suggestions are=
 certainly welcome.

Finally, I will mention that we have enabled PCBGROUPS in some of our testi=
ng with 9.1 and found no change for our particular workload with high conne=
ction establishment rates.

Thanks,
Mike

-----Original Message-----
From: Jeff Roberson [mailto:jroberson@jroberson.net]=20
Sent: Wednesday, May 22, 2013 12:50 AM
To: John Baldwin
Cc: Bentkofsky, Michael; rwatson@freebsd.org; jeff@freebsd.org; Charbon, Ju=
lien
Subject: Re: Followup from Verisign after last week's developer summit

On Tue, 21 May 2013, Jeff Roberson wrote:

> On Tue, 21 May 2013, John Baldwin wrote:
>
>> On Monday, May 20, 2013 9:48:02 am Bentkofsky, Michael wrote:
>>> Greetings gentlemen,
>>>=20
>>> It was a pleasure to meet you all last week at the FreeBSD developer=20
>>> summit.
>> I would like to thank you for spending the time to discuss all the=20
>> wonderful internals of the network stack. We also thoroughly enjoyed=20
>> the discussion on receive side scaling.
>>>=20
>>> I'm sure you will remember both Julien Charbon and me asking=20
>>> questions
>> regarding the TCP stack implementation, specifically around the=20
>> locking internals. I am hoping to follow-up with a path forward so we=20
>> might be able to enhance the connection rate performance. Our=20
>> internal testing has found that the V_tcpinfo lock prevents TCP=20
>> scaling under high connection setup and teardown rates. In fact, we=20
>> surmise that a new "FIN flood" attack may be possible to degrade=20
>> server connections significantly.
>>>=20
>>> In short, we are interested in changing this locking strategy and=20
>>> hope to
>> get input from someone with more familiarity with the implementation.=20
>> We're willing to be part of the coding effort and are willing to=20
>> submit our suggestions to the community. I think we might just need=20
>> some occasional input.
>>>=20
>>> Also, I will point out that our similar testing on Linux shows that=20
>>> the
>> comparable performance between the two operating systems on the same=20
>> multi- core hardware is significantly different. We're able to drive=20
>> over 200,000 connections per second on a Linux server compared to=20
>> fewer than 50,000 on the FreeBSD server. We have kernel profiling=20
>> details that we can share if you'd like.
>>=20
>> I have seen similar results with a redis cluster at work (we ended up=20
>> deploying proxies to allow applications to reuse existing connections=20
>> to avoid this).  I believe Linux uses RCU for this table.  You could=20
>> perhaps use an rm lock instead of an rw lock.  On idea I considered=20
>> was to split the the pcbhash lock up further so you had one lock per=20
>> hash bucket so that you could allow concurrent connection=20
>> setup/teardown so long as they were referencing different buckets. =20
>> However, I did not think this would have been useful for the case at=20
>> work since those connections were insane (single packet request=20
>> followed by single packet reply with all the setup/teardown overhead)=20
>> and all going to the same listening socket (so all the setup's would=20
>> hash to the same bucket).  Handling concurrent setup on the same=20
>> listen socket is a PITA but is in fact the common case.
>
> I don't think it's simply a synchronization primitive problem.  It=20
> looks to me like the fundamental issue is that the lock order for the=20
> tables is prior to the inp lock which means we have to grab it very=20
> early. Presumably this is the classic sort of container ->=20
> datastructure, datastructure -> container lock order problem.  This=20
> seems to be made more complex by protecting the list of all pcbs, the=20
> port allocation, and parts of the hash by the same lock.
>
> Have we tried to further decompose this lock?  I would experiment with=20
> that as a first step.  Is this grabbed in so many places just due to=20
> the complex lock order issue?  That seems to be the case.  There are=20
> only a handful of fields marked as protected by the inp info lock.  Do=20
> we know that this list is complete?
>
> My second step would be to attempt to turn the locking on its head.=20
> Change the lock order from inp lock to inp info lock.  You can resolve=20
> the lookup problem by adding an atomic reference count that holds the=20
> datastructure while you drop the hash lock and before you acquire the=20
> inp lock.  Then you could re-validate the inp after lookup.  I suspect=20
> it's not that simple and there are higher level races that you'll=20
> discover are being serialized by this big lock but that's just a hunch.
>

I read some more.  We have already done this lookup/ref/etc. dance for the =
hash lock.  It handles the hard cases of multiple inp_* calls and synchroni=
zing the ports, bind, connect, etc.  It looks like the list locks have been=
 optimized to make the iterators simple.  I think this is backwards now.  W=
e should make the iterators complex and the normal setup/teardown path simp=
le.  The iterators can follow a model like the hash lock using sentinels to=
 hold their place.  We have the same pattern elsewhere.  It would allow you=
 to acquire the INP_INFO lock after the INP lock and push it much deeper in=
to the stack.

Jeff


> What do you think Robert?  If it would make improving the tcb locking=20
> simpler it may fall under the umbrella of what Isilon needs but I'm=20
> not sure that's the case.  Certainly my earlier attempts at deferred=20
> processing were made more complex by this arrangement.
>
> Thanks,
> Jeff
>
>>=20
>> The best forum for discussing this is probably on net@ as there are=20
>> likely other interested parties who might have additional ideas. =20
>> Also, it might be interesting to look at how connection groups try to=20
>> handle this.  I believe they use an altenate method of decomposing=20
>> the global lock into smaller chunks, and I think they might do=20
>> something to help mitigate the listen socket problem (perhaps they=20
>> duplicate listen sockets in all groups)?  Robert would be able to=20
>> chime in on that, but I believe he is not really back home until next=20
>> week.
>>=20
>> --
>> John Baldwin
>>=20
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?080FBD5B7A09F845842100A6DE79623321F703B5>