From owner-freebsd-current@FreeBSD.ORG Sat Nov 6 02:33:52 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F19FE106564A for ; Sat, 6 Nov 2010 02:33:52 +0000 (UTC) (envelope-from pyunyh@gmail.com) Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id ACBA18FC08 for ; Sat, 6 Nov 2010 02:33:52 +0000 (UTC) Received: by iwn39 with SMTP id 39so3568587iwn.13 for ; Fri, 05 Nov 2010 19:33:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:received:from:date:to:cc :subject:message-id:reply-to:references:mime-version:content-type :content-disposition:in-reply-to:user-agent; bh=Pwp/j6Gvv2NpW7IzCKexAClFro/8lZWqoorKwn0lk58=; b=OmLfWaiCG/V/eFbrF89t0+iCP929usq3jC7CCaCSB4JKGLJgnOwBU045QEuMym5pfX rUWpaRTsf+tXVPJ/yR96KD4dY5EM9o9tFTiWqg8ACHhUNe5YIQ6aLJYMzoMXUxSzYbO0 6Pxu4nXzxeRFAuwnu7Q8eLeJ4DO6WtHqskMFg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:date:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=rp6ZkY3uNfw+zOvFUDAURQFBErL05PyG218u+CjzgOEiLCsCD/0afitmFa/TEiUjIP mBTNgYbCIoAK8x1HUdX/uaG6EMUmS7ijAXEgqTDcJhk9+6eb6sToeVO7L8vvcfIag6y9 ra/4L0EKrs2duC5yoQHOKLCdMzklo+owFsGSo= Received: by 10.231.33.137 with SMTP id h9mr2267723ibd.117.1289010831950; Fri, 05 Nov 2010 19:33:51 -0700 (PDT) Received: from pyunyh@gmail.com ([174.35.1.224]) by mx.google.com with ESMTPS id i16sm108705ibl.0.2010.11.05.19.33.49 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 05 Nov 2010 19:33:50 -0700 (PDT) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Fri, 5 Nov 2010 19:33:45 -0700 From: Pyun YongHyeon Date: Fri, 5 Nov 2010 19:33:45 -0700 To: Rick Macklem Message-ID: <20101106023345.GC22715@michelle.cdnetworks.com> References: <20101105023153.GA20301@michelle.cdnetworks.com> <77175273.185228.1289000696906.JavaMail.root@erie.cs.uoguelph.ca> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="1yeeQ81UyVL57Vl7" Content-Disposition: inline In-Reply-To: <77175273.185228.1289000696906.JavaMail.root@erie.cs.uoguelph.ca> User-Agent: Mutt/1.4.2.3i Cc: freebsd-current@freebsd.org Subject: Re: re(4) driver dropping packets when reading NFS files X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 Nov 2010 02:33:53 -0000 --1yeeQ81UyVL57Vl7 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Fri, Nov 05, 2010 at 07:44:56PM -0400, Rick Macklem wrote: > > On Thu, Nov 04, 2010 at 09:31:30PM -0400, Rick Macklem wrote: > > > > > > > > If the counter was not wrapped, it seem you lost more than 10% out > > > > of > > > > total RX frames. This is a lot loss and there should be a way to > > > > mitigate it. > > > > > > > I've attached a patch (to the if_re.c in head, not your patched > > > variant) > > > that works a lot better (about 5Mbytes/sec read rate). To get that, > > > I > > > had to disable msi and not clear the RL_IMR register in re_intr(). I > > > suspect that a packet would be received between when the bits in > > > RL_IMR > > > were cleared and when they were set at the end of re_int_task() and > > > those > > > were getting lost. > > > > > > This patch doesn't completely fix the problem. (I added your stats > > > collecting > > > stuff to the if_re.c in head and attached the result, which still > > > shows some lost packets. One > > > thought is clearing the bits in RL_ISR in re_intr() instead of > > > re_int_task(), > > > but then I can't see a good way to pass the old value of the status > > > reg. > > > through to re_int_task()? > > > > > > > Hmm, I still don't understand how the patch mitigates the issue. :-( > > The patch does not disable interrupts in interrupt handler so > > taskqueue runs with interrupt enabled. This may ensure not loosing > > interrupts but it may also generate many interrupts. By chance, did > > you check number of interrupts generated with/without your patch? > > > I agree. I think all it did was create more interrupts so that re_rxeof() > got called more frequently. (see below) > > > The only guess I have at the moment is the writing RL_IMR in > > interrupt handler at the end of taskqueue might be not immediately > > reflected so controller can loose interrupts for the time window. > > Would you try attach patch and let me know it makes any difference? > > > Nope, it didn't make any difference. > > I've added a counter of how many times re_rxeof() gets called, but then > returns without handling any received packets (I think because > RL_RDESC_STAT_OWN is set on the first entry it looks at in the rcv. ring.) > > This count comes out as almost the same as the # of missed frames (see > "rxe did 0:" in the attached stats). > > So, I think what is happenning about 10% of the time is that re_rxeof() > is looking at the ring descriptor before it has been "updated" and then > returns without handling the packet and then it doesn't get called again > because the RL_ISR bit has been cleared. > > When "options DEVICE_POLLING" is specified, it works ok, because it calls > re_rxeof() fairly frequently and it doesn't pay any attention to the RL_ISR > bits. > That's one of possible theory. See below for another theory. > Now, I don't know if this is a hardware flaw on this machine or something > that can be fixed in software? I know diddly about the current driver I highly doubt it could be hardware issue. > setup, but I assume something like this has to happen? > - chip (or PCIe handler) forces the DMA of the descriptor change to RAM > - then posts the interrupt > - and the driver must read the up to date descriptor from RAM > --> I notice that "volatile" isn't used anywhere in the driver. Wouldn't > the descriptor ring memory have to be defined "volatile" somehow? > (I don't know how to do this correctly?) Because driver check different descriptor location in each loop, adding volatile keyword wouldn't help. You can verify that whether that makes any difference though. Another possibility I have in mind is the controller would have reported RL_ISR_RX_OVERRUN but re(4) didn't check that condition. The old data sheet I have does not clearly mention when it is set and what other bits of RL_ISR register is affected from the bit. If RL_ISR_RX_OVERRUN bit is set when there are no more free RX descriptors available to controller and RL_ISR_RX_ERR is not set when RL_ISR_RX_OVERRUN is set, re(4) have ignored that event. Because driver ignored that error interrupt, the next error interrupt would be RL_ISR_FIFO_OFLOW error and this time it would be served in re_rxeof() and will refill RX buffers. However driver would have lost lots of frames received between the time window RL_ISR_RX_OVERRUN error and RL_ISR_FIFO_OFLOW error. If this theory is correct, the attached patch may mitigate the issue. > > So, if you can think of anything that will ensure that re_rxeof() reads > "up to date" descriptor ring entries, please let me know. > It's job of bus_dma(9) and I don't think barrier instructions would be helpful here as I don't see out-of-order execution in RX handler. > Otherwise, I can live with "options DEVICE_POLLING". > Sorry, I can't live with DEVICE_POLLING on re(4). :-( Let's kill driver bug. No one reported this kind of issue so far and I guess most users took it granted for the poor performance because they are using low end consumer grade controller. > Thanks for your help, rick > re0 statistics: > Transmit good frames : 101346 > Receive good frames : 133390 > Tx errors : 0 > Rx errors : 0 > Rx missed frames : 14394 > Rx frame alignment errs : 0 > Tx single collisions : 0 > Tx multiple collisions : 0 > Rx unicast frames : 133378 > Rx broadcast frames : 0 > Rx multicast frames : 12 > Tx aborts : 0 > Tx underruns : 0 > rxe did 0: 14359 --1yeeQ81UyVL57Vl7 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="re.intr.patch3" Index: sys/pci/if_rlreg.h =================================================================== --- sys/pci/if_rlreg.h (revision 214844) +++ sys/pci/if_rlreg.h (working copy) @@ -873,9 +873,7 @@ int rl_twist_row; int rl_twist_col; int suspended; /* 0 = normal 1 = suspended */ -#ifdef DEVICE_POLLING int rxcycles; -#endif struct task rl_txtask; struct task rl_inttask; Index: sys/dev/re/if_re.c =================================================================== --- sys/dev/re/if_re.c (revision 214844) +++ sys/dev/re/if_re.c (working copy) @@ -1860,7 +1860,7 @@ int i, total_len; struct rl_desc *cur_rx; u_int32_t rxstat, rxvlan; - int maxpkt = 16, rx_npkts = 0; + int rx_npkts = 0; RL_LOCK_ASSERT(sc); @@ -1872,7 +1872,7 @@ sc->rl_ldata.rl_rx_list_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); - for (i = sc->rl_ldata.rl_rx_prodidx; maxpkt > 0; + for (i = sc->rl_ldata.rl_rx_prodidx; sc->rxcycles > 0; i = RL_RX_DESC_NXT(sc, i)) { if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) break; @@ -2036,7 +2036,7 @@ } } } - maxpkt--; + sc->rxcycles--; if (rxvlan & RL_RDESC_VLANCTL_TAG) { m->m_pkthdr.ether_vtag = bswap16((rxvlan & RL_RDESC_VLANCTL_DATA)); @@ -2058,7 +2058,7 @@ if (rx_npktsp != NULL) *rx_npktsp = rx_npkts; - if (maxpkt) + if (sc->rxcycles) return (EAGAIN); return (0); @@ -2258,8 +2258,11 @@ } #endif - if (status & (RL_ISR_RX_OK|RL_ISR_RX_ERR|RL_ISR_FIFO_OFLOW)) + if (status & (RL_ISR_RX_OK | RL_ISR_RX_ERR | RL_ISR_FIFO_OFLOW | + RL_ISR_RX_OVERRUN)) { + sc->rxcycles = sc->rl_ldata.rl_rx_desc_cnt / 2; rval = re_rxeof(sc, NULL); + } /* * Some chips will ignore a second TX request issued --1yeeQ81UyVL57Vl7--