From owner-freebsd-hackers@FreeBSD.ORG Wed Apr 1 21:34:58 2015 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 634AB2FE; Wed, 1 Apr 2015 21:34:58 +0000 (UTC) Received: from mail-qg0-x22d.google.com (mail-qg0-x22d.google.com [IPv6:2607:f8b0:400d:c04::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 14D5AD8C; Wed, 1 Apr 2015 21:34:58 +0000 (UTC) Received: by qgep97 with SMTP id p97so54879301qge.1; Wed, 01 Apr 2015 14:34:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=5NMhpa9CLwBZZX4OcNsNQPC2YV7fEC50ndQ3Nbl0m8A=; b=EJ6GISIgPq5sMvL3w1CT3FTq+sN0tCfg8jh6RV7sxFuXv2DuLNprYh33OLd4MxAFb2 PDZxaJsEM7wLdZ+CcG2D1FStUZPkOUdk6WrovE4wk7n7fr5/R6lgWnEAHOIGShWtLJF7 NEsKYCNTvBtznHJfQX/MsyPZuMCHZ9/guCFnLVZbRlEokTHFHomAgp0tHbrx9hoRxDlh rjgy7TpmvZrnt7dW6FdWCcJVhDvVCVzOUR+LoCvE9SvQNEcHZT4BoQuBstoBwX5UR+q7 VaTYZgmhGf9Mg7LrNYCAZBiFEURSx+bqUSqsuKEn4snyR4qSPT1M+OJigZGTHtC28W/R WeyQ== MIME-Version: 1.0 X-Received: by 10.55.53.137 with SMTP id c131mr29896704qka.102.1427924097244; Wed, 01 Apr 2015 14:34:57 -0700 (PDT) Received: by 10.140.38.73 with HTTP; Wed, 1 Apr 2015 14:34:57 -0700 (PDT) In-Reply-To: <20150401212303.GB2379@kib.kiev.ua> References: <551BC57D.5070101@gmail.com> <551C5A82.2090306@gmail.com> <20150401212303.GB2379@kib.kiev.ua> Date: Wed, 1 Apr 2015 14:34:57 -0700 Message-ID: Subject: Re: NVMe performance 4x slower than expected From: Jim Harris To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: "freebsd-hackers@freebsd.org" , Tobias Oberstein , Michael Fuckner , Alan Somers X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Apr 2015 21:34:58 -0000 On Wed, Apr 1, 2015 at 2:23 PM, Konstantin Belousov wrote: > On Wed, Apr 01, 2015 at 10:52:18PM +0200, Tobias Oberstein wrote: > > > > FreeBSD 11 Current with patches (DMAR and ZFS patches, otherwise > the box > > > > doesn't boot at all .. because of 3TB RAM and the amount of > periphery). > > > > > > Do you still have WITNESS and INVARIANTS turned on in your kernel > > > config? They're turned on by default for Current, but they do have > > > some performance impact. To turn them off, just build a > > > GENERIC-NODEBUG kernel . > > > > WITNESS is off, INVARIANTS is still on. > INVARIANTS are costly. > > > > > Here is complete config: > > > > > https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_kernel_conf.md > > > > This is the aggregated patch (work was done by Konstantin - thanks again > > btw!) > > > > > https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_patch.md > > > > > Could you also post full dmesg output as well as vmstat -i? > > > > dmesg: > > > > > https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_dmesg.md > > > > vmstat: > > > > > https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd_vmstat.md > > > > === > > > > Here are results from FIO under FreeBSD: > > > > > https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/freebsd.md > > > > Here are results using _same_ FIO control file under Linux: > > > > > https://github.com/oberstet/scratchbox/blob/master/freebsd/cruncher/results/linux.md > > Is this vmstat after the test ? > Somewhat funny is that nvme does not use MSI(X). > Yes - this is exactly the problem. nvme does use MSI-X if it can allocate the vectors (one per core). With 48 cores, I suspect we are quickly running out of vectors, so NVMe is reverting to INTx. Could you actually send vmstat -ia (I left off the 'a' previously) - just so we can see all allocated interrupt vectors. As an experiment, can you try disabling hyperthreading - this will reduce the number of cores and should let you get MSI-X vectors allocated for at least the first couple of NVMe controllers. Then please re-run your performance test on one of those controllers. sys/x86/x86/local_apic.c defines APIC_NUM_IOINTS to 191 - it looks like this is the actual limit for MSI-X vectors, even though NUM_MSI_INTS is set to 512. > > I have the following patch for a long time, it allowed to increase pps > in iperf and similar tests when DMAR is enabled. In your case it could > reduce the rate of the DMAR interrupts. > > diff --git a/sys/x86/iommu/intel_ctx.c b/sys/x86/iommu/intel_ctx.c > index a18adcf..b23a4c1 100644 > --- a/sys/x86/iommu/intel_ctx.c > +++ b/sys/x86/iommu/intel_ctx.c > @@ -586,6 +586,18 @@ dmar_ctx_unload_entry(struct dmar_map_entry *entry, > bool free) > } > } > > +static struct dmar_qi_genseq * > +dmar_ctx_unload_gseq(struct dmar_ctx *ctx, struct dmar_map_entry *entry, > + struct dmar_qi_genseq *gseq) > +{ > + > + if (TAILQ_NEXT(entry, dmamap_link) != NULL) > + return (NULL); > + if (ctx->batch_no++ % dmar_batch_coalesce != 0) > + return (NULL); > + return (gseq); > +} > + > void > dmar_ctx_unload(struct dmar_ctx *ctx, struct dmar_map_entries_tailq > *entries, > bool cansleep) > @@ -619,8 +631,7 @@ dmar_ctx_unload(struct dmar_ctx *ctx, struct > dmar_map_entries_tailq *entries, > entry->gseq.gen = 0; > entry->gseq.seq = 0; > dmar_qi_invalidate_locked(ctx, entry->start, entry->end - > - entry->start, TAILQ_NEXT(entry, dmamap_link) == NULL ? > - &gseq : NULL); > + entry->start, dmar_ctx_unload_gseq(ctx, entry, &gseq)); > } > TAILQ_FOREACH_SAFE(entry, entries, dmamap_link, entry1) { > entry->gseq = gseq; > diff --git a/sys/x86/iommu/intel_dmar.h b/sys/x86/iommu/intel_dmar.h > index 2865ab5..6e0ab7f 100644 > --- a/sys/x86/iommu/intel_dmar.h > +++ b/sys/x86/iommu/intel_dmar.h > @@ -93,6 +93,7 @@ struct dmar_ctx { > u_int entries_cnt; > u_long loads; > u_long unloads; > + u_int batch_no; > struct dmar_gas_entries_tree rb_root; > struct dmar_map_entries_tailq unload_entries; /* Entries to unload > */ > struct dmar_map_entry *first_place, *last_place; > @@ -339,6 +340,7 @@ extern dmar_haddr_t dmar_high; > extern int haw; > extern int dmar_tbl_pagecnt; > extern int dmar_match_verbose; > +extern int dmar_batch_coalesce; > extern int dmar_check_free; > > static inline uint32_t > diff --git a/sys/x86/iommu/intel_drv.c b/sys/x86/iommu/intel_drv.c > index c239579..e7dc3f9 100644 > --- a/sys/x86/iommu/intel_drv.c > +++ b/sys/x86/iommu/intel_drv.c > @@ -153,7 +153,7 @@ dmar_count_iter(ACPI_DMAR_HEADER *dmarh, void *arg) > return (1); > } > > -static int dmar_enable = 0; > +static int dmar_enable = 1; > static void > dmar_identify(driver_t *driver, device_t parent) > { > diff --git a/sys/x86/iommu/intel_utils.c b/sys/x86/iommu/intel_utils.c > index f696f9d..d3c3267 100644 > --- a/sys/x86/iommu/intel_utils.c > +++ b/sys/x86/iommu/intel_utils.c > @@ -624,6 +624,7 @@ dmar_barrier_exit(struct dmar_unit *dmar, u_int > barrier_id) > } > > int dmar_match_verbose; > +int dmar_batch_coalesce = 100; > > static SYSCTL_NODE(_hw, OID_AUTO, dmar, CTLFLAG_RD, NULL, ""); > SYSCTL_INT(_hw_dmar, OID_AUTO, tbl_pagecnt, CTLFLAG_RD, > @@ -632,6 +633,9 @@ SYSCTL_INT(_hw_dmar, OID_AUTO, tbl_pagecnt, CTLFLAG_RD, > SYSCTL_INT(_hw_dmar, OID_AUTO, match_verbose, CTLFLAG_RWTUN, > &dmar_match_verbose, 0, > "Verbose matching of the PCI devices to DMAR paths"); > +SYSCTL_INT(_hw_dmar, OID_AUTO, batch_coalesce, CTLFLAG_RW | CTLFLAG_TUN, > + &dmar_batch_coalesce, 0, > + "Number of qi batches between interrupt"); > #ifdef INVARIANTS > int dmar_check_free; > SYSCTL_INT(_hw_dmar, OID_AUTO, check_free, CTLFLAG_RWTUN, >