From owner-dev-commits-src-all@freebsd.org Sun Oct 3 05:04:27 2021 Return-Path: Delivered-To: dev-commits-src-all@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 97BF56B0916; Sun, 3 Oct 2021 05:04:27 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4HMWwq3ZMNz4QtJ; Sun, 3 Oct 2021 05:04:27 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 2B50220700; Sun, 3 Oct 2021 05:04:27 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.16.1/8.16.1) with ESMTP id 19354RJM095202; Sun, 3 Oct 2021 05:04:27 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.16.1/8.16.1/Submit) id 19354RLa095201; Sun, 3 Oct 2021 05:04:27 GMT (envelope-from git) Date: Sun, 3 Oct 2021 05:04:27 GMT Message-Id: <202110030504.19354RLa095201@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-branches@FreeBSD.org From: Kevin Bowling Subject: git: a5a91bd2a09f - stable/12 - iflib: ensure that tx interrupts enabled and cleanups MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: kbowling X-Git-Repository: src X-Git-Refname: refs/heads/stable/12 X-Git-Reftype: branch X-Git-Commit: a5a91bd2a09ffe748b6dd7d5f996fe725c22c774 Auto-Submitted: auto-generated X-BeenThere: dev-commits-src-all@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Commit messages for all branches of the src repository List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Oct 2021 05:04:27 -0000 The branch stable/12 has been updated by kbowling (ports committer): URL: https://cgit.FreeBSD.org/src/commit/?id=a5a91bd2a09ffe748b6dd7d5f996fe725c22c774 commit a5a91bd2a09ffe748b6dd7d5f996fe725c22c774 Author: Matt Macy AuthorDate: 2020-12-19 01:08:33 +0000 Commit: Kevin Bowling CommitDate: 2021-10-03 02:03:11 +0000 iflib: ensure that tx interrupts enabled and cleanups Doing a 'dd' over iscsi will reliably cause stalls. Tx cleaning _should_ reliably happen as data is sent. However, currently if the transmit queue fills it will wait until the iflib timer (hz/2) runs. This change causes the the tx taskq thread to be run if there are completed descriptors. While here: - make timer interrupt delay a sysctl - simplify txd_db_check handling - comment on INTR types Background on the change: Initially doorbell updates were minimized by only writing to the register on every fourth packet. If txq_drain would return without writing to the doorbell it scheduled a callout on the next tick to do the doorbell write to ensure that the write otherwise happened "soon". At that time a sysctl was added for users to avoid the potential added latency by simply writing to the doorbell register on every packet. This worked perfectly well for e1000 and ixgbe ... and appeared to work well on ixl. However, as it turned out there was a race to this approach that would lockup the ixl MAC. It was possible for a lower producer index to be written after a higher one. On e1000 and ixgbe this was harmless - on ixl it was fatal. My initial response was to add a lock around doorbell writes - fixing the problem but adding an unacceptable amount of lock contention. The next iteration was to use transmit interrupts to drive delayed doorbell writes. If there were no packets in the queue all doorbell writes would be immediate as the queue started to fill up we could delay doorbell writes further and further. At the start of drain if we've cleaned any packets we know we've moved the state machine along and we write the doorbell (an obvious missing optimization was to skip that doorbell write if db_pending is zero). This change required that tx interrupts be scheduled periodically as opposed to just when the hardware txq was full. However, that just leads to our next problem. Initially dedicated msix vectors were used for both tx and rx. However, it was often possible to use up all available vectors before we set up all the queues we wanted. By having rx and tx share a vector for a given queue we could halve the number of vectors used by a given configuration. The problem here is that with this change only e1000 passed the necessary value to have the fast interrupt drive tx when appropriate. Reported by: mav@ Tested by: mav@ Reviewed by: gallatin@ MFC after: 1 month Sponsored by: iXsystems Differential Revision: https://reviews.freebsd.org/D27683 (cherry picked from commit 81be655266fac2b333e25f386f32c9bcd17f523d) --- sys/dev/bnxt/if_bnxt.c | 2 +- sys/dev/ice/if_ice_iflib.c | 2 +- sys/dev/ixgbe/if_ix.c | 2 +- sys/dev/ixgbe/if_ixv.c | 2 +- sys/dev/ixl/if_iavf.c | 2 +- sys/dev/ixl/if_ixl.c | 2 +- sys/dev/vmware/vmxnet3/if_vmx.c | 2 +- sys/net/iflib.c | 94 +++++++++++++++++++++++++++-------------- sys/net/iflib.h | 17 ++++++++ 9 files changed, 86 insertions(+), 39 deletions(-) diff --git a/sys/dev/bnxt/if_bnxt.c b/sys/dev/bnxt/if_bnxt.c index 9408da32d78a..3a0fdb85e848 100644 --- a/sys/dev/bnxt/if_bnxt.c +++ b/sys/dev/bnxt/if_bnxt.c @@ -1524,7 +1524,7 @@ bnxt_msix_intr_assign(if_ctx_t ctx, int msix) for (i=0; iscctx->isc_nrxqsets; i++) { snprintf(irq_name, sizeof(irq_name), "rxq%d", i); rc = iflib_irq_alloc_generic(ctx, &softc->rx_cp_rings[i].irq, - softc->rx_cp_rings[i].ring.id + 1, IFLIB_INTR_RX, + softc->rx_cp_rings[i].ring.id + 1, IFLIB_INTR_RXTX, bnxt_handle_rx_cp, &softc->rx_cp_rings[i], i, irq_name); if (rc) { device_printf(iflib_get_dev(ctx), diff --git a/sys/dev/ice/if_ice_iflib.c b/sys/dev/ice/if_ice_iflib.c index 6fdb738b192b..25eaae8ed26d 100644 --- a/sys/dev/ice/if_ice_iflib.c +++ b/sys/dev/ice/if_ice_iflib.c @@ -1493,7 +1493,7 @@ ice_if_msix_intr_assign(if_ctx_t ctx, int msix) snprintf(irq_name, sizeof(irq_name), "rxq%d", i); err = iflib_irq_alloc_generic(ctx, &sc->irqvs[vector].irq, rid, - IFLIB_INTR_RX, ice_msix_que, + IFLIB_INTR_RXTX, ice_msix_que, rxq, rxq->me, irq_name); if (err) { device_printf(sc->dev, diff --git a/sys/dev/ixgbe/if_ix.c b/sys/dev/ixgbe/if_ix.c index 0ce1a5441a62..581944fa6a73 100644 --- a/sys/dev/ixgbe/if_ix.c +++ b/sys/dev/ixgbe/if_ix.c @@ -2038,7 +2038,7 @@ ixgbe_if_msix_intr_assign(if_ctx_t ctx, int msix) snprintf(buf, sizeof(buf), "rxq%d", i); error = iflib_irq_alloc_generic(ctx, &rx_que->que_irq, rid, - IFLIB_INTR_RX, ixgbe_msix_que, rx_que, rx_que->rxr.me, buf); + IFLIB_INTR_RXTX, ixgbe_msix_que, rx_que, rx_que->rxr.me, buf); if (error) { device_printf(iflib_get_dev(ctx), diff --git a/sys/dev/ixgbe/if_ixv.c b/sys/dev/ixgbe/if_ixv.c index 114a41ea8256..805701073e0a 100644 --- a/sys/dev/ixgbe/if_ixv.c +++ b/sys/dev/ixgbe/if_ixv.c @@ -1026,7 +1026,7 @@ ixv_if_msix_intr_assign(if_ctx_t ctx, int msix) snprintf(buf, sizeof(buf), "rxq%d", i); error = iflib_irq_alloc_generic(ctx, &rx_que->que_irq, rid, - IFLIB_INTR_RX, ixv_msix_que, rx_que, rx_que->rxr.me, buf); + IFLIB_INTR_RXTX, ixv_msix_que, rx_que, rx_que->rxr.me, buf); if (error) { device_printf(iflib_get_dev(ctx), diff --git a/sys/dev/ixl/if_iavf.c b/sys/dev/ixl/if_iavf.c index 0cb6f54a8c92..28b76eced25a 100644 --- a/sys/dev/ixl/if_iavf.c +++ b/sys/dev/ixl/if_iavf.c @@ -898,7 +898,7 @@ iavf_if_msix_intr_assign(if_ctx_t ctx, int msix) snprintf(buf, sizeof(buf), "rxq%d", i); err = iflib_irq_alloc_generic(ctx, &rx_que->que_irq, rid, - IFLIB_INTR_RX, iavf_msix_que, rx_que, rx_que->rxr.me, buf); + IFLIB_INTR_RXTX, iavf_msix_que, rx_que, rx_que->rxr.me, buf); /* XXX: Does the driver work as expected if there are fewer num_rx_queues than * what's expected in the iflib context? */ if (err) { diff --git a/sys/dev/ixl/if_ixl.c b/sys/dev/ixl/if_ixl.c index 52a6fc480d24..52c9dd293e6b 100644 --- a/sys/dev/ixl/if_ixl.c +++ b/sys/dev/ixl/if_ixl.c @@ -1071,7 +1071,7 @@ ixl_if_msix_intr_assign(if_ctx_t ctx, int msix) snprintf(buf, sizeof(buf), "rxq%d", i); err = iflib_irq_alloc_generic(ctx, &rx_que->que_irq, rid, - IFLIB_INTR_RX, ixl_msix_que, rx_que, rx_que->rxr.me, buf); + IFLIB_INTR_RXTX, ixl_msix_que, rx_que, rx_que->rxr.me, buf); /* XXX: Does the driver work as expected if there are fewer num_rx_queues than * what's expected in the iflib context? */ if (err) { diff --git a/sys/dev/vmware/vmxnet3/if_vmx.c b/sys/dev/vmware/vmxnet3/if_vmx.c index a74d09988a2a..b3cddcbc879f 100644 --- a/sys/dev/vmware/vmxnet3/if_vmx.c +++ b/sys/dev/vmware/vmxnet3/if_vmx.c @@ -482,7 +482,7 @@ vmxnet3_msix_intr_assign(if_ctx_t ctx, int msix) rxq = &sc->vmx_rxq[i]; error = iflib_irq_alloc_generic(ctx, &rxq->vxrxq_irq, i + 1, - IFLIB_INTR_RX, vmxnet3_rxq_intr, rxq, i, irq_name); + IFLIB_INTR_RXTX, vmxnet3_rxq_intr, rxq, i, irq_name); if (error) { device_printf(iflib_get_dev(ctx), "Failed to register rxq %d interrupt handler\n", i); diff --git a/sys/net/iflib.c b/sys/net/iflib.c index 2d53b1ef8229..3a86671c7443 100644 --- a/sys/net/iflib.c +++ b/sys/net/iflib.c @@ -591,6 +591,10 @@ SYSCTL_INT(_net_iflib, OID_AUTO, min_tx_latency, CTLFLAG_RW, static int iflib_no_tx_batch = 0; SYSCTL_INT(_net_iflib, OID_AUTO, no_tx_batch, CTLFLAG_RW, &iflib_no_tx_batch, 0, "minimize transmit latency at the possible expense of throughput"); +static int iflib_timer_default = 1000; +SYSCTL_INT(_net_iflib, OID_AUTO, timer_default, CTLFLAG_RW, + &iflib_timer_default, 0, "number of ticks between iflib_timer calls"); + #if IFLIB_DEBUG_COUNTERS @@ -2462,7 +2466,7 @@ iflib_timer(void *arg) ** can be done without the lock because its RO ** and the HUNG state will be static if set. */ - if (this_tick - txq->ift_last_timer_tick >= hz / 2) { + if (this_tick - txq->ift_last_timer_tick >= iflib_timer_default) { txq->ift_last_timer_tick = this_tick; IFDI_TIMER(ctx, txq->ift_id); if ((txq->ift_qstatus == IFLIB_QUEUE_HUNG) && @@ -2472,7 +2476,8 @@ iflib_timer(void *arg) if (txq->ift_qstatus != IFLIB_QUEUE_IDLE && ifmp_ring_is_stalled(txq->ift_br)) { - KASSERT(ctx->ifc_link_state == LINK_STATE_UP, ("queue can't be marked as hung if interface is down")); + KASSERT(ctx->ifc_link_state == LINK_STATE_UP, + ("queue can't be marked as hung if interface is down")); txq->ift_qstatus = IFLIB_QUEUE_HUNG; } txq->ift_cleaned_prev = txq->ift_cleaned; @@ -2483,7 +2488,8 @@ iflib_timer(void *arg) sctx->isc_pause_frames = 0; if (if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING) - callout_reset_on(&txq->ift_timer, hz / 2, iflib_timer, txq, txq->ift_timer.c_cpu); + callout_reset_on(&txq->ift_timer, iflib_timer_default, iflib_timer, + txq, txq->ift_timer.c_cpu); return; hung: @@ -2601,7 +2607,7 @@ done: IFDI_INTR_ENABLE(ctx); txq = ctx->ifc_txqs; for (i = 0; i < sctx->isc_ntxqsets; i++, txq++) - callout_reset_on(&txq->ift_timer, hz/2, iflib_timer, txq, + callout_reset_on(&txq->ift_timer, iflib_timer_default, iflib_timer, txq, txq->ift_timer.c_cpu); /* Re-enable txsync/rxsync. */ @@ -3123,26 +3129,37 @@ txq_max_rs_deferred(iflib_txq_t txq) /* XXX we should be setting this to something other than zero */ #define RECLAIM_THRESH(ctx) ((ctx)->ifc_sctx->isc_tx_reclaim_thresh) -#define MAX_TX_DESC(ctx) max((ctx)->ifc_softc_ctx.isc_tx_tso_segments_max, \ +#define MAX_TX_DESC(ctx) MAX((ctx)->ifc_softc_ctx.isc_tx_tso_segments_max, \ (ctx)->ifc_softc_ctx.isc_tx_nsegments) static inline bool -iflib_txd_db_check(if_ctx_t ctx, iflib_txq_t txq, int ring, qidx_t in_use) +iflib_txd_db_check(iflib_txq_t txq, int ring) { + if_ctx_t ctx = txq->ift_ctx; qidx_t dbval, max; - bool rang; - rang = false; - max = TXQ_MAX_DB_DEFERRED(txq, in_use); - if (ring || txq->ift_db_pending >= max) { + max = TXQ_MAX_DB_DEFERRED(txq, txq->ift_in_use); + + /* force || threshold exceeded || at the edge of the ring */ + if (ring || (txq->ift_db_pending >= max) || (TXQ_AVAIL(txq) <= MAX_TX_DESC(ctx) + 2)) { + + /* + * 'npending' is used if the card's doorbell is in terms of the number of descriptors + * pending flush (BRCM). 'pidx' is used in cases where the card's doorbeel uses the + * producer index explicitly (INTC). + */ dbval = txq->ift_npending ? txq->ift_npending : txq->ift_pidx; bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_ifdi->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); ctx->isc_txd_flush(ctx->ifc_softc, txq->ift_id, dbval); + + /* + * Absent bugs there are zero packets pending so reset pending counts to zero. + */ txq->ift_db_pending = txq->ift_npending = 0; - rang = true; + return (true); } - return (rang); + return (false); } #ifdef PKT_DEBUG @@ -3603,6 +3620,7 @@ defrag: MPASS(pi.ipi_new_pidx != pidx); MPASS(ndesc > 0); txq->ift_in_use += ndesc; + txq->ift_db_pending += ndesc; /* * We update the last software descriptor again here because there may @@ -3770,8 +3788,8 @@ iflib_txq_drain(struct ifmp_ring *r, uint32_t cidx, uint32_t pidx) if_ctx_t ctx = txq->ift_ctx; if_t ifp = ctx->ifc_ifp; struct mbuf *m, **mp; - int avail, bytes_sent, consumed, count, err, i, in_use_prev; - int mcast_sent, pkt_sent, reclaimed, txq_avail; + int avail, bytes_sent, skipped, count, err, i; + int mcast_sent, pkt_sent, reclaimed; bool do_prefetch, rang, ring; if (__predict_false(!(if_getdrvflags(ifp) & IFF_DRV_RUNNING) || @@ -3780,9 +3798,13 @@ iflib_txq_drain(struct ifmp_ring *r, uint32_t cidx, uint32_t pidx) return (0); } reclaimed = iflib_completed_tx_reclaim(txq, RECLAIM_THRESH(ctx)); - rang = iflib_txd_db_check(ctx, txq, reclaimed, txq->ift_in_use); + rang = iflib_txd_db_check(txq, reclaimed && txq->ift_db_pending); avail = IDXDIFF(pidx, cidx, r->size); + if (__predict_false(ctx->ifc_flags & IFC_QFLUSH)) { + /* + * The driver is unloading so we need to free all pending packets. + */ DBG_COUNTER_INC(txq_drain_flushing); for (i = 0; i < avail; i++) { if (__predict_true(r->items[(cidx + i) & (r->size-1)] != (void *)txq)) @@ -3800,9 +3822,13 @@ iflib_txq_drain(struct ifmp_ring *r, uint32_t cidx, uint32_t pidx) DBG_COUNTER_INC(txq_drain_oactive); return (0); } + + /* + * If we've reclaimed any packets this queue cannot be hung. + */ if (reclaimed) txq->ift_qstatus = IFLIB_QUEUE_IDLE; - consumed = mcast_sent = bytes_sent = pkt_sent = 0; + skipped = mcast_sent = bytes_sent = pkt_sent = 0; count = MIN(avail, TX_BATCH_SIZE); #ifdef INVARIANTS if (iflib_verbose_debug) @@ -3810,54 +3836,57 @@ iflib_txq_drain(struct ifmp_ring *r, uint32_t cidx, uint32_t pidx) avail, ctx->ifc_flags, TXQ_AVAIL(txq)); #endif do_prefetch = (ctx->ifc_flags & IFC_PREFETCH); - txq_avail = TXQ_AVAIL(txq); err = 0; - for (i = 0; i < count && txq_avail > MAX_TX_DESC(ctx) + 2; i++) { + for (i = 0; i < count && TXQ_AVAIL(txq) >= MAX_TX_DESC(ctx) + 2; i++) { int rem = do_prefetch ? count - i : 0; mp = _ring_peek_one(r, cidx, i, rem); MPASS(mp != NULL && *mp != NULL); + + /* + * Completion interrupts will use the address of the txq + * as a sentinel to enqueue _something_ in order to acquire + * the lock on the mp_ring (there's no direct lock call). + * We obviously whave to check for these sentinel cases + * and skip them. + */ if (__predict_false(*mp == (struct mbuf *)txq)) { - consumed++; + skipped++; continue; } - in_use_prev = txq->ift_in_use; err = iflib_encap(txq, mp); if (__predict_false(err)) { /* no room - bail out */ if (err == ENOBUFS) break; - consumed++; + skipped++; /* we can't send this packet - skip it */ continue; } - consumed++; pkt_sent++; m = *mp; DBG_COUNTER_INC(tx_sent); bytes_sent += m->m_pkthdr.len; mcast_sent += !!(m->m_flags & M_MCAST); - txq_avail = TXQ_AVAIL(txq); - txq->ift_db_pending += (txq->ift_in_use - in_use_prev); - ETHER_BPF_MTAP(ifp, m); if (__predict_false(!(ifp->if_drv_flags & IFF_DRV_RUNNING))) break; - rang = iflib_txd_db_check(ctx, txq, false, in_use_prev); + ETHER_BPF_MTAP(ifp, m); + rang = iflib_txd_db_check(txq, false); } /* deliberate use of bitwise or to avoid gratuitous short-circuit */ - ring = rang ? false : (iflib_min_tx_latency | err) || (TXQ_AVAIL(txq) < MAX_TX_DESC(ctx)); - iflib_txd_db_check(ctx, txq, ring, txq->ift_in_use); + ring = rang ? false : (iflib_min_tx_latency | err); + iflib_txd_db_check(txq, ring); if_inc_counter(ifp, IFCOUNTER_OBYTES, bytes_sent); if_inc_counter(ifp, IFCOUNTER_OPACKETS, pkt_sent); if (mcast_sent) if_inc_counter(ifp, IFCOUNTER_OMCASTS, mcast_sent); #ifdef INVARIANTS if (iflib_verbose_debug) - printf("consumed=%d\n", consumed); + printf("consumed=%d\n", skipped + pkt_sent); #endif - return (consumed); + return (skipped + pkt_sent); } static uint32_t @@ -4030,7 +4059,7 @@ _task_fn_admin(void *context) } IFDI_UPDATE_ADMIN_STATUS(ctx); for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++) { - callout_reset_on(&txq->ift_timer, hz / 2, iflib_timer, txq, + callout_reset_on(&txq->ift_timer, iflib_timer_default, iflib_timer, txq, txq->ift_timer.c_cpu); } IFDI_LINK_INTR_ENABLE(ctx); @@ -5660,6 +5689,7 @@ iflib_device_iov_add_vf(device_t dev, uint16_t vfnum, const nvlist_t *params) static int iflib_module_init(void) { + iflib_timer_default = hz / 2; return (0); } @@ -7096,7 +7126,7 @@ iflib_netdump_transmit(if_t ifp, struct mbuf *m) txq = &ctx->ifc_txqs[0]; error = iflib_encap(txq, &m); if (error == 0) - (void)iflib_txd_db_check(ctx, txq, true, txq->ift_in_use); + (void)iflib_txd_db_check(txq, true); return (error); } diff --git a/sys/net/iflib.h b/sys/net/iflib.h index 662da8748c54..ffbae937808c 100644 --- a/sys/net/iflib.h +++ b/sys/net/iflib.h @@ -282,10 +282,27 @@ typedef struct iflib_dma_info { #define IFLIB_MAGIC 0xCAFEF00D typedef enum { + /* Interrupt or softirq handles only receive */ IFLIB_INTR_RX, + + /* Interrupt or softirq handles only transmit */ IFLIB_INTR_TX, + + /* + * Interrupt will check for both pending receive + * and available tx credits and dispatch a task + * for one or both depending on the disposition + * of the respective queues. + */ IFLIB_INTR_RXTX, + + /* + * Other interrupt - typically link status and + * or error conditions. + */ IFLIB_INTR_ADMIN, + + /* Softirq (task) for iov handling */ IFLIB_INTR_IOV, } iflib_intr_type_t;