Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 23 May 2014 14:06:55 +0200
From:      Julien Charbon <jcharbon@verisign.com>
To:        freebsd-net@freebsd.org
Cc:        George Neville-Neil <gnn@FreeBSD.org>
Subject:   Re: TCP stack lock contention with short-lived connections
Message-ID:  <537F39DF.1090900@verisign.com>
In-Reply-To: <len481$sfv$2@ger.gmane.org>
References:  <op.w51mxed6ak5tgc@fri2jcharbon-m1.local> <op.w56mamc0ak5tgc@dul1rjacobso-l3.vcorp.ad.vrsn.com> <len481$sfv$2@ger.gmane.org>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format.
--------------030706050106030100090504
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit


  Hi,

On 27/02/14 11:32, Julien Charbon wrote:
> On 07/11/13 14:55, Julien Charbon wrote:
>> On Mon, 04 Nov 2013 22:21:04 +0100, Julien Charbon
>> <jcharbon@verisign.com> wrote:
>>> I have put technical and how-to-repeat details in below PR:
>>>
>>> kern/183659: TCP stack lock contention with short-lived connections
>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=183659
>>>
>>>   We are currently working on this performance improvement effort;  it
>>> will impact only the TCP locking strategy not the TCP stack logic
>>> itself.  We will share on freebsd-net the patches we made for
>>> reviewing and improvement propositions;  anyway this change might also
>>> require enough eyeballs to avoid tricky race conditions introduction
>>> in TCP stack.
>
> [...]
>
>   2. We studied an another lock contention related to INP_INFO when TCP
> connections in TIME_WAIT state are cleaned-up. [...]

   First a follow-up on TCP performance improvements and BSDCan 2014 
discussions:

  - First, changes related to TCP lock connection and TIME_WAIT list 
have been reviewed, fixed, improved and accepted in HEAD:

http://svnweb.freebsd.org/base?view=revision&revision=264321
http://svnweb.freebsd.org/base?view=revision&revision=264342
http://svnweb.freebsd.org/base?view=revision&revision=264351
http://svnweb.freebsd.org/base?view=revision&revision=264356

  Thanks to testers, reviewers and committers (jhb, rwatson, rrs, 
adrian, glebius, gnn, bde, jmg, rrs, etc.).

  - Second, it has been discussed in BSDCan that it would nice to also 
share Verisign patches in public github.  This is done here:

  https://github.com/verisign/freebsd

  We currently maintain two branches:

  - One that follows 10.0-RELENG branch:

https://github.com/verisign/freebsd/commits/share/10.0-next

  - One that follows HEAD:

https://github.com/verisign/freebsd/commits/share/head-next

  These branches are cloned from on the handy github FreeBSD repository 
mirror:

https://github.com/freebsd/freebsd

  We pushed our stable enough TCP related patches in this repo.

  - Third, joined a patch (tcp-scale-syn-v1.patch) we would propose for 
review.  Obviously, it is also available here:

https://github.com/verisign/freebsd/commit/38679f5f66b470d069d635bfc8d9ec6c39eb787b

  This patch is based (like most of our TCP changes) on Robert Watson 
major change to decompose inpcbinfo lock (aka ipi_lock or INP_INFO:)

Decompose the current single inpcbinfo lock into two locks:
http://svnweb.freebsd.org/base?view=revision&revision=222488
https://github.com/freebsd/freebsd/commit/fdfdadb612a4b077d62094c7d4aa65de3524cf62

  In particular on this comment:

"
- The TCP syncache still relies on the pcbinfo lock, something that we
   may want to revisit.
"

  Basically this change prevents to take the TCP ipi_lock lock when a 
SYN targeting a listening socket is received.  We expected a need to add 
another syncache mutex to take over ipi_lock role, but surprisingly we 
don't find any requirements for a new mutex (and it doesn't mean that 
this requirement doesn't exist).

  This patch is quite stable under our workload, improves only a bit the 
short-lived TCP connection rate (~+2%), however it can become useful 
under SYN flood attack pressure.

  We still have to test more thoroughly this patch in two contexts:

  - VNET and,
  - syncache.hashsize sets very low relative to TCP workload

  As usual, tests, reviews and comments more than welcome.

  Thanks.

--
Julien

--------------030706050106030100090504
Content-Type: text/plain; charset=UTF-8;
 name="tcp-scale-syn-v1.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="tcp-scale-syn-v1.patch"

diff --git a/sys/netinet/tcp_input.c b/sys/netinet/tcp_input.c
index f9a751c..4b6f41f 100644
--- a/sys/netinet/tcp_input.c
+++ b/sys/netinet/tcp_input.c
@@ -588,6 +588,7 @@ tcp_input(struct mbuf *m, int off0)
 #endif /* INET6 */
 	struct tcpopt to;		/* options in this segment */
 	char *s = NULL;			/* address and port logging */
+	int needlock;
 	int ti_locked;
 #define	TI_UNLOCKED	1
 #define	TI_WLOCKED	2
@@ -770,12 +771,12 @@ tcp_input(struct mbuf *m, int off0)
 
 	/*
 	 * Locate pcb for segment; if we're likely to add or remove a
-	 * connection then first acquire pcbinfo lock.  There are two cases
+	 * connection then first acquire pcbinfo lock.  There are three cases
 	 * where we might discover later we need a write lock despite the
 	 * flags: ACKs moving a connection out of the syncache, and ACKs for
-	 * a connection in TIMEWAIT.
+	 * a connection in TIMEWAIT, SYNs for a non-listening socket.
 	 */
-	if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0) {
+	if ((thflags & (TH_FIN | TH_RST)) != 0) {
 		INP_INFO_WLOCK(&V_tcbinfo);
 		ti_locked = TI_WLOCKED;
 	} else
@@ -1004,10 +1005,18 @@ tcp_input(struct mbuf *m, int off0)
 	 * now be in TIMEWAIT.
 	 */
 #ifdef INVARIANTS
-	if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0)
+	if ((thflags & (TH_FIN | TH_RST)) != 0)
 		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
 #endif
-	if (tp->t_state != TCPS_ESTABLISHED) {
+	needlock = 1;
+	if (tp->t_state == TCPS_ESTABLISHED) {
+		if ((thflags & TH_SYN) == 0)
+			needlock = 0;
+	} else {
+		if (tp->t_state == TCPS_LISTEN && (thflags & TH_SYN))
+			needlock = 0;
+	}
+	if (needlock) {
 		if (ti_locked == TI_UNLOCKED) {
 			if (INP_INFO_TRY_WLOCK(&V_tcbinfo) == 0) {
 				in_pcbref(inp);
@@ -1048,17 +1057,13 @@ tcp_input(struct mbuf *m, int off0)
 	/*
 	 * When the socket is accepting connections (the INPCB is in LISTEN
 	 * state) we look into the SYN cache if this is a new connection
-	 * attempt or the completion of a previous one.  Because listen
-	 * sockets are never in TCPS_ESTABLISHED, the V_tcbinfo lock will be
-	 * held in this case.
+	 * attempt or the completion of a previous one.
 	 */
 	if (so->so_options & SO_ACCEPTCONN) {
 		struct in_conninfo inc;
 
 		KASSERT(tp->t_state == TCPS_LISTEN, ("%s: so accepting but "
 		    "tp not listening", __func__));
-		INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
-
 		bzero(&inc, sizeof(inc));
 #ifdef INET6
 		if (isipv6) {
@@ -1081,6 +1086,8 @@ tcp_input(struct mbuf *m, int off0)
 		 * socket appended to the listen queue in SYN_RECEIVED state.
 		 */
 		if ((thflags & (TH_RST|TH_ACK|TH_SYN)) == TH_ACK) {
+
+			INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
 			/*
 			 * Parse the TCP options here because
 			 * syncookies need access to the reflected
@@ -1361,8 +1368,12 @@ tcp_input(struct mbuf *m, int off0)
 		syncache_add(&inc, &to, th, inp, &so, m, NULL, NULL);
 		/*
 		 * Entry added to syncache and mbuf consumed.
-		 * Everything already unlocked by syncache_add().
+		 * Nothing is unlocked by syncache_add().
 		 */
+		if (ti_locked == TI_WLOCKED) {
+			INP_INFO_WUNLOCK(&V_tcbinfo);
+			ti_locked = TI_UNLOCKED;
+		}
 		INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
 		return;
 	} else if (tp->t_state == TCPS_LISTEN) {
diff --git a/sys/netinet/tcp_syncache.c b/sys/netinet/tcp_syncache.c
index beb6749..9b981d3 100644
--- a/sys/netinet/tcp_syncache.c
+++ b/sys/netinet/tcp_syncache.c
@@ -1119,7 +1119,6 @@ syncache_add(struct in_conninfo *inc, struct tcpopt *to, struct tcphdr *th,
 	struct syncache scs;
 	struct ucred *cred;
 
-	INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
 	INP_WLOCK_ASSERT(inp);			/* listen socket */
 	KASSERT((th->th_flags & (TH_RST|TH_ACK|TH_SYN)) == TH_SYN,
 	    ("%s: unexpected tcp flags", __func__));
@@ -1150,13 +1149,11 @@ syncache_add(struct in_conninfo *inc, struct tcpopt *to, struct tcphdr *th,
 #ifdef MAC
 	if (mac_syncache_init(&maclabel) != 0) {
 		INP_WUNLOCK(inp);
-		INP_INFO_WUNLOCK(&V_tcbinfo);
 		goto done;
 	} else
 		mac_syncache_create(maclabel, inp);
 #endif
 	INP_WUNLOCK(inp);
-	INP_INFO_WUNLOCK(&V_tcbinfo);
 
 	/*
 	 * Remember the IP options, if any.

--------------030706050106030100090504--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?537F39DF.1090900>