From owner-freebsd-net@FreeBSD.ORG  Fri Mar 11 15:44:53 2005
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id E488416A4CE; Fri, 11 Mar 2005 15:44:52 +0000 (GMT)
Received: from xorpc.icir.org (xorpc.icir.org [192.150.187.68])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 66E3C43D5A; Fri, 11 Mar 2005 15:44:52 +0000 (GMT)
	(envelope-from rizzo@icir.org)
Received: from xorpc.icir.org (localhost [127.0.0.1])
	by xorpc.icir.org (8.12.11/8.12.8) with ESMTP id j2BFinVU070807;
	Fri, 11 Mar 2005 07:44:49 -0800 (PST)
	(envelope-from rizzo@xorpc.icir.org)
Received: (from rizzo@localhost)
	by xorpc.icir.org (8.12.11/8.12.3/Submit) id j2BFinO2070806;
	Fri, 11 Mar 2005 07:44:49 -0800 (PST)
	(envelope-from rizzo)
Date: Fri, 11 Mar 2005 07:44:48 -0800
From: Luigi Rizzo <rizzo@icir.org>
To: Gleb Smirnoff <glebius@freebsd.org>
Message-ID: <20050311074448.A70328@xorpc.icir.org>
References: <E1D1nPr-000IE5-00._pppp-mail-ru@f37.mail.ru>
	<200503011642.48813.jhb@FreeBSD.org> <20050301162949.A64187@xorpc.icir.org>
	<20050311110234.GA87255@cell.sick.ru>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5.1i
In-Reply-To: <20050311110234.GA87255@cell.sick.ru>;
	from glebius@freebsd.org on Fri, Mar 11, 2005 at 02:02:34PM +0300
cc: dima <_pppp@mail.ru>
cc: pjd@freebsd.org
cc: John Baldwin <jhb@freebsd.org>
cc: rwatson@freebsd.org
cc: net@freebsd.org
Subject: Re: Giant-free polling [PATCH]
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Mar 2005 15:44:53 -0000

while i am happy to see interest on having this supported, and while
I understand the excitement in firing the editor and writing code,
i don't think this approach of hacking up some patch that allows
multiple poll* instances to run without panicing the box, or
discussing the performance of lists vs arrays vs IFF_* flags will
lead to anything.

As I asked this already, consider this msg a rant, for archival
purposes, calling for a sane approach to the problem.

Polling in the UP case is beneficial because it has some features
listed e.g. in

http://docs.freebsd.org/cgi/getmsg.cgi?fetch=128950+0+/usr/local/www/db/text/2005/freebsd-net/20050306.freebsd-net

For the SMP case there are the basic issues to handle (proper locking,
preventing deadlocks) but also performance issues e.g. to avoid
that multiple polling instances just fight for the lock but do
nothing useful, see e.g. the case described in

http://docs.freebsd.org/cgi/getmsg.cgi?fetch=71980+0+/usr/local/www/db/text/2005/freebsd-net/20050306.freebsd-net


Now, if you want to approach the issue seriously, you need to:

1) _design_ a locking strategy for SMP polling -- design meaning
   provide a description in english, even a sketchy one,
   which will eventually end up in a manpage such as polling(4),
   or in the source file as a comment.
   Certainly a design is not C code or patches without comments from which
   one should figure out the overall picture.

2) evaluate how the design compares in terms of features (e.g. those listed
   in the first reference above) to the UP case. I am not
   saying one should implement all of them, but at least it must
   be clear how the two things are different, if they are.

3) evaluate how the design addresses (in terms of performance)
   the contention problems, of which the second reference above is
   probably just one but at least a very obvious one;

4) implement it -- and if you want it to be reviewed, please post
   the full code, not just differences -- the code will be terribly
   different to be human readable in terms of a diff;

5) optimise it -- e.g. deal with arrays vs lists, implicit vs
   explicit flags, etc.

It seems to me that this whole discussion is proceding from step 4
and ignoring the first 3 steps. Please, step back from your keyboards
and move to the whiteboard, if nothing else to prove me that you
have already addressed all the issues.

	cheers
	luigi

On Fri, Mar 11, 2005 at 02:02:34PM +0300, Gleb Smirnoff wrote:

>   Collegues,
> 
> On Tue, Mar 01, 2005 at 04:29:49PM -0800, Luigi Rizzo wrote:
> L> [cc-ing net@freebsd.org to get more opinions]
> 
> this good, that you have CC'ed net@, otherwise we would continue
> working in parallel without collaboration.
> 
> Here is attached patch made in a collaboration by ru, pjd and me.
> 
> We use separate mutex for polling, and we drop it before calling the
> poll handler to avoid LOR and contention on this mutex.
> 
> We have definite evidence that now idle_poll and ISR poll can run
> in parallel: if we remove PRF_RUNNING flag check, we got panics
> because poll handlers of many drivers (e.g. if_em) are not reentrable.
> 
> I think this patch is a step forward in direction of parallel polling
> on SMP, since we now have two polling instances (ISR + idle) working
> in parallel.
> 
> -- 
> Totus tuus, Glebius.
> GLEBIUS-RIPN GLEB-RIPE

> Index: kern_poll.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/kern_poll.c,v
> retrieving revision 1.19
> diff -u -r1.19 kern_poll.c
> --- kern_poll.c	25 Feb 2005 22:07:50 -0000	1.19
> +++ kern_poll.c	11 Mar 2005 10:49:47 -0000
> @@ -33,6 +33,7 @@
>  #include <sys/kernel.h>
>  #include <sys/socket.h>			/* needed by net/if.h		*/
>  #include <sys/sysctl.h>
> +#include <sys/syslog.h>
>  
>  #include <net/if.h>			/* for IFF_* flags		*/
>  #include <net/netisr.h>			/* for NETISR_POLL		*/
> @@ -144,7 +145,7 @@
>  SYSCTL_INT(_kern_polling, OID_AUTO, residual_burst, CTLFLAG_RW,
>  	&residual_burst, 0, "# of residual cycles in burst");
>  
> -static u_int32_t poll_handlers; /* next free entry in pr[]. */
> +static u_int32_t poll_handlers;
>  SYSCTL_UINT(_kern_polling, OID_AUTO, handlers, CTLFLAG_RD,
>  	&poll_handlers, 0, "Number of registered poll handlers");
>  
> @@ -169,20 +170,33 @@
>  	&idlepoll_sleeping, 0, "idlepoll is sleeping");
>  
>  
> -#define POLL_LIST_LEN  128
>  struct pollrec {
>  	poll_handler_t	*handler;
>  	struct ifnet	*ifp;
> +#define	PRF_RUNNING	0x1
> +#define	PRF_PINNED	0x2
> +	uint32_t	flags;
> +	LIST_ENTRY(pollrec) next;
>  };
>  
> -static struct pollrec pr[POLL_LIST_LEN];
> +#define	PR_VALID(pr)	((pr)->handler != NULL &&			\
> +			!((pr)->flags & PRF_RUNNING) &&			\
> +			((pr)->ifp->if_flags & (IFF_UP|IFF_RUNNING)) ==	\
> +			    (IFF_UP|IFF_RUNNING))
> +
> +static LIST_HEAD(, pollrec) poll_list = LIST_HEAD_INITIALIZER(&poll_list);
> +
> +static struct mtx	poll_mtx;
>  
>  static void
>  init_device_poll(void)
>  {
>  
> -	netisr_register(NETISR_POLL, (netisr_t *)netisr_poll, NULL, 0);
> -	netisr_register(NETISR_POLLMORE, (netisr_t *)netisr_pollmore, NULL, 0);
> +	mtx_init(&poll_mtx, "polling", NULL, MTX_DEF);
> +	netisr_register(NETISR_POLL, (netisr_t *)netisr_poll, NULL,
> +	    NETISR_MPSAFE);
> +	netisr_register(NETISR_POLLMORE, (netisr_t *)netisr_pollmore, NULL,
> +	    NETISR_MPSAFE);
>  }
>  SYSINIT(device_poll, SI_SUB_CLOCKS, SI_ORDER_MIDDLE, init_device_poll, NULL)
>  
> @@ -244,17 +258,26 @@
>  void
>  ether_poll(int count)
>  {
> -	int i;
> -
> -	mtx_lock(&Giant);
> +	struct pollrec *pr, *pr1;
>  
>  	if (count > poll_each_burst)
>  		count = poll_each_burst;
> -	for (i = 0 ; i < poll_handlers ; i++)
> -		if (pr[i].handler && (IFF_UP|IFF_RUNNING) ==
> -		    (pr[i].ifp->if_flags & (IFF_UP|IFF_RUNNING)) )
> -			pr[i].handler(pr[i].ifp, 0, count); /* quick check */
> -	mtx_unlock(&Giant);
> +
> +	mtx_lock(&poll_mtx);
> +	LIST_FOREACH_SAFE(pr, &poll_list, next, pr1) {
> +		if (PR_VALID(pr)) {
> +			pr->flags |= PRF_RUNNING;
> +			if (pr1 != NULL)
> +				pr1->flags |= PRF_PINNED;
> +			mtx_unlock(&poll_mtx);
> +			pr->handler(pr->ifp, 0, count);
> +			mtx_lock(&poll_mtx);
> +			if (pr1 != NULL)
> +				pr1->flags &= ~PRF_PINNED;
> +			pr->flags &= ~PRF_RUNNING;
> +		}
> +	}
> +	mtx_unlock(&poll_mtx);
>  }
>  
>  /*
> @@ -320,17 +343,17 @@
>  
>  /*
>   * netisr_poll is scheduled by schednetisr when appropriate, typically once
> - * per tick. It is called at splnet() so first thing to do is to upgrade to
> - * splimp(), and call all registered handlers.
> + * per tick.
>   */
>  static void
>  netisr_poll(void)
>  {
> +	struct pollrec *pr, *pr1;
>  	static int reg_frac_count;
> -	int i, cycles;
> +	int cycles;
>  	enum poll_cmd arg = POLL_ONLY;
> -	mtx_lock(&Giant);
>  
> +	mtx_lock(&poll_mtx);
>  	phase = 3;
>  	if (residual_burst == 0) { /* first call in this tick */
>  		microuptime(&poll_start_t);
> @@ -372,25 +395,36 @@
>  	residual_burst -= cycles;
>  
>  	if (polling) {
> -		for (i = 0 ; i < poll_handlers ; i++)
> -			if (pr[i].handler && (IFF_UP|IFF_RUNNING) ==
> -			    (pr[i].ifp->if_flags & (IFF_UP|IFF_RUNNING)) )
> -				pr[i].handler(pr[i].ifp, arg, cycles);
> +		LIST_FOREACH_SAFE(pr, &poll_list, next, pr1) {
> +			if (PR_VALID(pr)) {
> +				pr->flags |= PRF_RUNNING;
> +				mtx_unlock(&poll_mtx);
> +				pr->handler(pr->ifp, arg, cycles);
> +				mtx_lock(&poll_mtx);
> +				pr->flags &= ~PRF_RUNNING;
> +			} else if (pr->handler == NULL &&
> +			    !(pr->flags & PRF_PINNED)) {
> +				LIST_REMOVE(pr, next);
> +				free(pr, M_TEMP);
> +			}
> +		}
>  	} else {	/* unregister */
> -		for (i = 0 ; i < poll_handlers ; i++) {
> -			if (pr[i].handler &&
> -			    pr[i].ifp->if_flags & IFF_RUNNING) {
> -				pr[i].ifp->if_flags &= ~IFF_POLLING;
> -				pr[i].handler(pr[i].ifp, POLL_DEREGISTER, 1);
> +		LIST_FOREACH_SAFE(pr, &poll_list, next, pr1) {
> +			if (pr->handler != NULL &&
> +			    pr->ifp->if_flags & IFF_RUNNING) {
> +				pr->ifp->if_flags &= ~IFF_POLLING;
> +				mtx_unlock(&poll_mtx);
> +				pr->handler(pr->ifp, POLL_DEREGISTER, 1);
> +				mtx_lock(&poll_mtx);
>  			}
> -			pr[i].handler=NULL;
> +			pr->handler = NULL;
>  		}
>  		residual_burst = 0;
>  		poll_handlers = 0;
>  	}
>  	/* on -stable, schednetisr(NETISR_POLLMORE); */
>  	phase = 4;
> -	mtx_unlock(&Giant);
> +	mtx_unlock(&poll_mtx);
>  }
>  
>  /*
> @@ -399,12 +433,12 @@
>   * A device is not supposed to register itself multiple times.
>   *
>   * This is called from within the *_intr() functions, so we do not need
> - * further locking.
> + * further ifnet locking.
>   */
>  int
>  ether_poll_register(poll_handler_t *h, struct ifnet *ifp)
>  {
> -	int s;
> +	struct pollrec *pr;
>  
>  	if (polling == 0) /* polling disabled, cannot register */
>  		return 0;
> @@ -415,30 +449,27 @@
>  	if (ifp->if_flags & IFF_POLLING)	/* already polling	*/
>  		return 0;
>  
> -	s = splhigh();
> -	if (poll_handlers >= POLL_LIST_LEN) {
> -		/*
> -		 * List full, cannot register more entries.
> -		 * This should never happen; if it does, it is probably a
> -		 * broken driver trying to register multiple times. Checking
> -		 * this at runtime is expensive, and won't solve the problem
> -		 * anyways, so just report a few times and then give up.
> -		 */
> -		static int verbose = 10 ;
> -		splx(s);
> -		if (verbose >0) {
> -			printf("poll handlers list full, "
> -				"maybe a broken driver ?\n");
> -			verbose--;
> +	mtx_lock(&poll_mtx);
> +	LIST_FOREACH(pr, &poll_list, next)
> +		if (pr->ifp == ifp && pr->handler != NULL) {
> +			mtx_unlock(&poll_mtx);
> +			log(LOG_DEBUG, "ether_poll_register: %s: handler"
> +			    " already registered\n", ifp->if_xname);
> +			return (0);
>  		}
> -		return 0; /* no polling for you */
> +	pr = malloc(sizeof(*pr), M_TEMP, M_NOWAIT);
> +	if (pr == NULL) {
> +		mtx_unlock(&poll_mtx);
> +		log(LOG_DEBUG, "ether_poll_register: malloc() failed\n");
> +		return (0);
>  	}
>  
> -	pr[poll_handlers].handler = h;
> -	pr[poll_handlers].ifp = ifp;
> +	pr->handler = h;
> +	pr->ifp = ifp;
> +	LIST_INSERT_HEAD(&poll_list, pr, next);
>  	poll_handlers++;
>  	ifp->if_flags |= IFF_POLLING;
> -	splx(s);
> +	mtx_unlock(&poll_mtx);
>  	if (idlepoll_sleeping)
>  		wakeup(&idlepoll_sleeping);
>  	return 1; /* polling enabled in next call */
> @@ -453,29 +484,24 @@
>  int
>  ether_poll_deregister(struct ifnet *ifp)
>  {
> -	int i;
> +	struct pollrec *pr;
>  
> -	mtx_lock(&Giant);
>  	if ( !ifp || !(ifp->if_flags & IFF_POLLING) ) {
> -		mtx_unlock(&Giant);
>  		return 0;
>  	}
> -	for (i = 0 ; i < poll_handlers ; i++)
> -		if (pr[i].ifp == ifp) /* found it */
> -			break;
> -	ifp->if_flags &= ~IFF_POLLING; /* found or not... */
> -	if (i == poll_handlers) {
> -		mtx_unlock(&Giant);
> -		printf("ether_poll_deregister: ifp not found!!!\n");
> -		return 0;
> -	}
> -	poll_handlers--;
> -	if (i < poll_handlers) { /* Last entry replaces this one. */
> -		pr[i].handler = pr[poll_handlers].handler;
> -		pr[i].ifp = pr[poll_handlers].ifp;
> -	}
> -	mtx_unlock(&Giant);
> -	return 1;
> +	mtx_lock(&poll_mtx);
> +	LIST_FOREACH(pr, &poll_list, next)
> +		if (pr->ifp == ifp && pr->handler != NULL) {
> +			pr->handler = NULL;
> +			poll_handlers--;
> +			mtx_unlock(&poll_mtx);
> +			ifp->if_flags &= ~IFF_POLLING;
> +			return (1);
> +		}
> +	mtx_unlock(&poll_mtx);
> +	log(LOG_DEBUG, "ether_poll_deregister: %s: not found!!!\n",
> +	    ifp->if_xname);
> +	return (0);
>  }
>  
>  static void
> @@ -495,10 +521,7 @@
>  	for (;;) {
>  		if (poll_in_idle_loop && poll_handlers > 0) {
>  			idlepoll_sleeping = 0;
> -			mtx_lock(&Giant);
>  			ether_poll(poll_each_burst);
> -			mtx_unlock(&Giant);
> -			mtx_assert(&Giant, MA_NOTOWNED);
>  			mtx_lock_spin(&sched_lock);
>  			mi_switch(SW_VOL, NULL);
>  			mtx_unlock_spin(&sched_lock);

> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"