Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 11 Jun 2014 01:46:31 +0400
From:      "Alexander V. Chernikov" <melifaro@FreeBSD.org>
To:        Bryan Venteicher <bryanv@daemoninthecloset.org>, current@freebsd.org,  net@freebsd.org
Subject:   Re: dhclient sucks cpu usage...
Message-ID:  <53977CB7.9010904@FreeBSD.org>
In-Reply-To: <20140610185626.GK31367@funkthat.com>
References:  <20140610000246.GW31367@funkthat.com> <100488220.4292.1402369436876.JavaMail.root@daemoninthecloset.org> <5396CD41.2080300@FreeBSD.org> <20140610162443.GD31367@funkthat.com> <5397415B.5070409@FreeBSD.org> <20140610185626.GK31367@funkthat.com>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format.
--------------040100020408040101050505
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

On 10.06.2014 22:56, John-Mark Gurney wrote:
> Alexander V. Chernikov wrote this message on Tue, Jun 10, 2014 at 21:33 +0400:
>> On 10.06.2014 20:24, John-Mark Gurney wrote:
>>> Alexander V. Chernikov wrote this message on Tue, Jun 10, 2014 at 13:17 
>>> +0400:
>>>> On 10.06.2014 07:03, Bryan Venteicher wrote:
>>>>> Hi,
>>>>>
>>>>> ----- Original Message -----
>>>>>> So, after finding out that nc has a stupidly small buffer size (2k
>>>>>> even though there is space for 16k), I was still not getting as good
>>>>>> as performance using nc between machines, so I decided to generate some
>>>>>> flame graphs to try to identify issues...  (Thanks to who included a
>>>>>> full set of modules, including dtraceall on memstick!)
>>>>>>
>>>>>> So, the first one is:
>>>>>> https://www.funkthat.com/~jmg/em.stack.svg
>>>>>>
>>>>>> As I was browsing around, the em_handle_que was consuming quite a bit
>>>>>> of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
>>>>>> me that the taskqueue for em was consuming about 50% cpu...  Also pretty
>>>>>> high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
>>>>>> consuming ~3.18% (under ether_nh_input)..  I know I'm not running 
>>>>>> tcpdump
>>>>>> or anything, but I think dhclient uses bpf to be able to inject packets
>>>>>> and listen in on them, so I kill off dhclient, and instantly, the
>>>>>> taskqueue
>>>>>> thread for em drops down to 40% CPU... (transfer rate only marginally
>>>>>> improves, if it does)
>>>>>>
>>>>>> I decide to run another flame graph w/o dhclient running:
>>>>>> https://www.funkthat.com/~jmg/em.stack.nodhclient.svg
>>>>>>
>>>>>> and now _rxeof drops from 17.22% to 11.94%, pretty significant...
>>>>>>
>>>>>> So, if you care about performance, don't run dhclient...
>>>>>>
>>>>> Yes, I've noticed the same issue. It can absolutely kill performance
>>>>> in a VM guest. It is much more pronounced on only some of my systems,
>>>>> and I hadn't tracked it down yet. I wonder if this is fallout from
>>>>> the callout work, or if there was some bpf change.
>>>>>
>>>>> I've been using the kludgey workaround patch below.
>>>> Hm, pretty interesting.
>>>> dhclient should setup proper filter (and it looks like it does so:
>>>> 13:10 [0] m@ptichko s netstat -B
>>>>   Pid  Netif   Flags      Recv      Drop     Match Sblen Hblen Command
>>>>  1224    em0 -ifs--l  41225922         0        11     0     0 dhclient
>>>> )
>>>> see "match" count.
>>>> And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for
>>>> each consumer on interface).
>>>> It should not introduce significant performance penalties.
>>> Don't forget that it has to process the returning ack's... So, you're
>> Well, it can be still captured with the proper filter like "ip && udp && 
>> port 67 or port 68".
>> We're using tcpdump on high packet ratios (>1M) and it does not 
>> influence process _much_.
>> We should probably convert its rwlock to rmlock and use per-cpu counters 
>> for statistics, but that's a different story.
>>> looking around 10k+ pps that you have to handle and pass through the
>>> filter...  That's a lot of packets to process...
>>>
>>> Just for a bit more "double check", instead of using the HD as a
>>> source, I used /dev/zero...   I ran a netstat -w 1 -I em0 when
>>> running the test, and I was getting ~50.7MiB/s w/ dhclient running and
>>> then I killed dhclient and it instantly jumped up to ~57.1MiB/s.. So I
>>> launched dhclient again, and it dropped back to ~50MiB/s...
>> dhclient uses different BPF sockets for reading and writing (and it 
>> moves write socket to privileged child process via fork().
>> The problem we're facing with is the fact that dhclient does not set 
>> _any_ read filter on write socket:
>> 21:27 [0] zfscurr0# netstat -B
>>   Pid  Netif   Flags      Recv      Drop     Match Sblen Hblen Command
>>  1529    em0 --fs--l     86774     86769     86784  4044  3180 dhclient
>> --------------------------------------- ^^^^^ --------------------------
>>  1526    em0 -ifs--l     86789         0         1     0     0 dhclient
>>
>> so all traffic is pushed down introducing contention on BPF descriptor 
>> mutex.
>>
>> (That's why I've asked for netstat -B output.)
>>
>> Please try an attached patch to fix this. This is not the right way to 
>> fix this, we'd better change BPF behavior not to attach to interface 
>> readers for write-only consumers.
>> This have been partially implemented as net.bpf.optimize_writers hack, 
>> but it does not work for all direct BPF consumers (which are not using 
>> pcap(3) API).
> 
> Ok, looks like this patch helps the issue...
> 
> netstat -B; sleep 5; netstat -B:
>   Pid  Netif   Flags      Recv      Drop     Match Sblen Hblen Command
>   958    em0 --fs--l   3880000        14        35  3868  2236 dhclient
>   976    em0 -ifs--l   3880014         0         1     0     0 dhclient
>   Pid  Netif   Flags      Recv      Drop     Match Sblen Hblen Command
>   958    em0 --fs--l   4178525        14        35  3868  2236 dhclient
>   976    em0 -ifs--l   4178539         0         1     0     0 dhclient
> 
> and now the rate only drops from ~66MiB/s to ~63MiB/s when dhclient is
> running...  Still a significant drop (5%), but better than before...
Interesting.
Can you provide some traces (pmc or dtrace ones)?

I'm unsure if this will help, but it's worth trying:
please revert my previous patch, apply an attached kernel patch,
reboot, set net.bpf.optimize_writers to 1 and try again?



> 


--------------040100020408040101050505
Content-Type: text/x-patch;
 name="bpf_optimize.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="bpf_optimize.diff"

Index: sys/net/bpf.c
===================================================================
--- sys/net/bpf.c	(revision 266306)
+++ sys/net/bpf.c	(working copy)
@@ -44,7 +44,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/types.h>
 #include <sys/param.h>
 #include <sys/lock.h>
-#include <sys/rwlock.h>
+#include <sys/rmlock.h>
 #include <sys/systm.h>
 #include <sys/conf.h>
 #include <sys/fcntl.h>
@@ -643,6 +643,20 @@ bpf_attachd(struct bpf_d *d, struct bpf_if *bp)
 }
 
 /*
+ * Check if filter looks like 'set snaplen'
+ * program used by pcap-bpf.c:pcap_activate_bpf()
+ */
+static void
+bpf_check_upgrade(u_long cmd, struct bpf_insn *fcode, u_int flen, int *is_snap)
+{
+
+	if (cmd == BIOCSETF && flen == 1 && fcode[0].code == (BPF_RET | BPF_K))
+		*is_snap = 1;
+	else
+		*is_snap = 0;
+}
+
+/*
  * Add d to the list of active bp filters.
  * Reuqires bpf_attachd() to be called before
  */
@@ -1728,7 +1742,7 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp,
 #endif
 	size_t size;
 	u_int flen;
-	int need_upgrade;
+	int is_snaplen, need_upgrade;
 
 #ifdef COMPAT_FREEBSD32
 	switch (cmd) {
@@ -1755,6 +1769,7 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp,
 #ifdef BPF_JITTER
 	jfunc = ofunc = NULL;
 #endif
+	is_snaplen = 0;
 	need_upgrade = 0;
 
 	/*
@@ -1773,6 +1788,8 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp,
 			free(fcode, M_BPF);
 			return (EINVAL);
 		}
+		/* Try to guess if this is snaplen cmd */
+		bpf_check_upgrade(cmd, fcode, flen, &is_snaplen);
 #ifdef BPF_JITTER
 		/* Filter is copied inside fcode and is perfectly valid. */
 		jfunc = bpf_jitter(fcode, flen);
@@ -1807,11 +1824,27 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp,
 			 * Do not require upgrade by first BIOCSETF
 			 * (used to set snaplen) by pcap_open_live().
 			 */
-			if (d->bd_writer != 0 && --d->bd_writer == 0)
-				need_upgrade = 1;
-			CTR4(KTR_NET, "%s: filter function set by pid %d, "
-			    "bd_writer counter %d, need_upgrade %d",
-			    __func__, d->bd_pid, d->bd_writer, need_upgrade);
+			if (d->bd_writer != 0) {
+				if (is_snaplen == 0) {
+					/*
+					 * We're probably using bpf directly.
+					 * Upgrade immediately.
+					 */
+					need_upgrade = 1;
+				} else if (--d->bd_writer == 0) {
+					/*
+					 * First snaplen filter has already
+					 * been set. This is probably catch-all
+					 * filter
+					 */
+					need_upgrade = 1;
+				}
+				CTR5(KTR_NET,
+				    "%s: filter function set by pid %d, "
+				    "bd_writer counter %d, snap %d upgrade %d",
+				    __func__, d->bd_pid, d->bd_writer,
+				    is_snaplen, need_upgrade);
+			}
 		}
 	}
 	BPFD_UNLOCK(d);
@@ -1842,6 +1875,7 @@ bpf_setif(struct bpf_d *d, struct ifreq *ifr)
 {
 	struct bpf_if *bp;
 	struct ifnet *theywant;
+	BPFIF_TRACKER;
 
 	BPF_LOCK_ASSERT();
 
@@ -2038,6 +2072,7 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, u_int pktl
 #endif
 	u_int slen;
 	int gottime;
+	BPFIF_TRACKER;
 
 	gottime = BPF_TSTAMP_NONE;
 
@@ -2105,6 +2140,7 @@ bpf_mtap(struct bpf_if *bp, struct mbuf *m)
 #endif
 	u_int pktlen, slen;
 	int gottime;
+	BPFIF_TRACKER;
 
 	/* Skip outgoing duplicate packets. */
 	if ((m->m_flags & M_PROMISC) != 0 && m->m_pkthdr.rcvif == NULL) {
@@ -2158,6 +2194,7 @@ bpf_mtap2(struct bpf_if *bp, void *data, u_int dle
 	struct bpf_d *d;
 	u_int pktlen, slen;
 	int gottime;
+	BPFIF_TRACKER;
 
 	/* Skip outgoing duplicate packets. */
 	if ((m->m_flags & M_PROMISC) != 0 && m->m_pkthdr.rcvif == NULL) {
@@ -2477,7 +2514,7 @@ bpfattach2(struct ifnet *ifp, u_int dlt, u_int hdr
 	LIST_INIT(&bp->bif_wlist);
 	bp->bif_ifp = ifp;
 	bp->bif_dlt = dlt;
-	rw_init(&bp->bif_lock, "bpf interface lock");
+	rm_init(&bp->bif_lock, "bpf interface lock");
 	KASSERT(*driverp == NULL, ("bpfattach2: driverp already initialized"));
 	*driverp = bp;
 
@@ -2582,7 +2619,7 @@ bpf_ifdetach(void *arg __unused, struct ifnet *ifp
 
 		LIST_REMOVE(bp, bif_next);
 
-		rw_destroy(&bp->bif_lock);
+		rm_destroy(&bp->bif_lock);
 		free(bp, M_BPF);
 
 		nmatched++;
@@ -2696,6 +2733,7 @@ bpf_zero_counters(void)
 {
 	struct bpf_if *bp;
 	struct bpf_d *bd;
+	BPFIF_TRACKER;
 
 	BPF_LOCK();
 	LIST_FOREACH(bp, &bpf_iflist, bif_next) {
@@ -2760,6 +2798,7 @@ bpf_stats_sysctl(SYSCTL_HANDLER_ARGS)
 	int index, error;
 	struct bpf_if *bp;
 	struct bpf_d *bd;
+	BPFIF_TRACKER;
 
 	/*
 	 * XXX This is not technically correct. It is possible for non
Index: sys/net/bpf.h
===================================================================
--- sys/net/bpf.h	(revision 266306)
+++ sys/net/bpf.h	(working copy)
@@ -1260,7 +1260,7 @@ struct bpf_if {
 	u_int bif_dlt;				/* link layer type */
 	u_int bif_hdrlen;		/* length of link header */
 	struct ifnet *bif_ifp;		/* corresponding interface */
-	struct rwlock bif_lock;		/* interface lock */
+	struct rmlock bif_lock;		/* interface lock */
 	LIST_HEAD(, bpf_d)	bif_wlist;	/* writer-only list */
 	int flags;			/* Interface flags */
 #endif
Index: sys/net/bpfdesc.h
===================================================================
--- sys/net/bpfdesc.h	(revision 266306)
+++ sys/net/bpfdesc.h	(working copy)
@@ -152,10 +152,11 @@ struct xbpf_d {
 	u_int64_t	bd_spare[4];
 };
 
-#define BPFIF_RLOCK(bif)	rw_rlock(&(bif)->bif_lock)
-#define BPFIF_RUNLOCK(bif)	rw_runlock(&(bif)->bif_lock)
-#define BPFIF_WLOCK(bif)	rw_wlock(&(bif)->bif_lock)
-#define BPFIF_WUNLOCK(bif)	rw_wunlock(&(bif)->bif_lock)
+#define	BPFIF_TRACKER		struct rm_priotracker tracker
+#define BPFIF_RLOCK(bif)	rm_rlock(&(bif)->bif_lock, &tracker)
+#define BPFIF_RUNLOCK(bif)	rm_runlock(&(bif)->bif_lock, &tracker)
+#define BPFIF_WLOCK(bif)	rm_wlock(&(bif)->bif_lock)
+#define BPFIF_WUNLOCK(bif)	rm_wunlock(&(bif)->bif_lock)
 
 #define BPFIF_FLAG_DYING	1	/* Reject new bpf consumers */
 
Index: sys/kern/subr_witness.c
===================================================================
--- sys/kern/subr_witness.c	(revision 266306)
+++ sys/kern/subr_witness.c	(working copy)
@@ -550,7 +550,7 @@ static struct witness_order_list_entry order_lists
 	 * BPF
 	 */
 	{ "bpf global lock", &lock_class_mtx_sleep },
-	{ "bpf interface lock", &lock_class_rw },
+	{ "bpf interface lock", &lock_class_rm },
 	{ "bpf cdev lock", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*

--------------040100020408040101050505--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53977CB7.9010904>