Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 16 Mar 2015 20:56:02 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Michael Fuckner <michael@fuckner.net>
Cc:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, Ryan Stone <rysto32@gmail.com>, Steven Hartland <killing@multiplay.co.uk>
Subject:   Re: Server with 3TB Crashing at boot
Message-ID:  <20150316185602.GJ2379@kib.kiev.ua>
In-Reply-To: <55072195.40609@fuckner.net>
References:  <20150315193202.GS2379@kib.kiev.ua> <2138577776.537937.1426455964006.JavaMail.open-xchange@ptangptang.store> <20150316091758.GY2379@kib.kiev.ua> <5506ADA4.8020207@fuckner.net> <20150316103140.GA2379@kib.kiev.ua> <5506B23F.20400@fuckner.net> <20150316105301.GB2379@kib.kiev.ua> <5506E8D6.30703@fuckner.net> <20150316154022.GD2379@kib.kiev.ua> <55072195.40609@fuckner.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Mar 16, 2015 at 07:31:49PM +0100, Michael Fuckner wrote:
> first patch doesn't look good, looks like ahci explodes
The patch I sent only changes ixgbe code.

This is not ahci, it is nvme causing some troubles.  The bug is genuine
DMAR bug, driver specified allocation with size greater than boundary,
and the code failed to split the request.

> 
> ahcich0: AHCI reset...
> ahcich0: SATA connect timeout time=10000us status=00000000
> ahcich0: AHCI reset: device not found
> ahcich1: AHCI reset...
> ahcich1: SATA connect time=1800us status=00000113
> ahcich1: AHCI reset: device found
> ahcich1: AHCI reset: device ready after 0ms
> ahcich2: AHCI reset...
> ahcich2: SATA connect timeout time=10000us status=00000000
> ahcich2: AHCI reset: device not found
> ahcich3: AHCI reset...
> ahcich3: SATA connect timeout time=10000us status=00000000
> ahcich3: AHCI reset: device not found
> ahcich4: panic: boundary failed: ctx 0xfffff801a4c2ca00 start 0x131000 
> end 0x133000 boundary 0x1000
> cpuid = 0
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
> 0xffffffff81c5e6f0
> vpanic() at vpanic+0x189/frame 0xffffffff81c5e770
> kassert_panic() at kassert_panic+0x132/frame 0xffffffff81c5e7e0
> dmar_bus_dmamap_load_something() at 
> dmar_bus_dmamap_load_something+0x35e/frame 0xffffffff81c5e890
> dmar_bus_dmamap_load_buffer() at dmar_bus_dmamap_load_buffer+0x246/frame 
> 0xffffffff81c5e910
> bus_dmamap_load() at bus_dmamap_load+0x8d/frame 0xffffffff81c5e990
> _nvme_qpair_submit_request() at _nvme_qpair_submit_request+0x1ca/frame 
> 0xffffffff81c5e9e0
> nvme_qpair_submit_request() at nvme_qpair_submit_request+0x38/frame 
> 0xffffffff81c5ea10
> nvme_ctrlr_start() at nvme_ctrlr_start+0x7b/frame 0xffffffff81c5ea80
> nvme_ctrlr_start_config_hook() at nvme_ctrlr_start_config_hook+0xe/frame 
> 0xffffffff81c5eaa0
> run_interrupt_driven_config_hooks() at 
> run_interrupt_driven_config_hooks+0x7c/frame 0xffffffff81c5eac0
> boot_run_interrupt_driven_config_hooks() at 
> boot_run_interrupt_driven_config_hooks+0x20/frame 0xffffffff81c5eb50
> mi_startup() at mi_startup+0x118/frame 0xffffffff81c5eb70
> btext() at btext+0x2c
> KDB: enter: panic
> [ thread pid 0 tid 100000 ]
> Stopped at      kdb_enter+0x3e: movq    $0,kdb_why
> db>
> 
> http://dedi3.fuckner.net/~molli123/temp/11-ixgbe.log
> 
> I'll reboot now and check if I patched the file correctly. But this 
> takes 45min.
> 
> At least I figured out how to remove the empty Lines (Ctrl-a, shift-A). 
> Don't run minicom inside a screen ;-)
I do not quite understand why such programs as minicom are needed at all.
Isn't tip (AKA cu) good enough ?

Please add the following patch to the kernel.

diff --git a/sys/x86/iommu/intel_gas.c b/sys/x86/iommu/intel_gas.c
index 51ad151..aa59c1b 100644
--- a/sys/x86/iommu/intel_gas.c
+++ b/sys/x86/iommu/intel_gas.c
@@ -327,13 +327,15 @@ dmar_gas_match_one(struct dmar_gas_match_args *a, struct dmar_map_entry *prev,
 	start = roundup2(bs, a->common->alignment);
 	/* DMAR_PAGE_SIZE to create gap after new entry. */
 	if (start + a->size + DMAR_PAGE_SIZE <= prev->end + prev->free_after &&
-	    start + a->size <= end) {
+	    start + a->size <= end && dmar_test_boundary(start, a->size,
+	    a->common->boundary)) {
 		a->entry->start = start;
 		return (true);
 	}
 
 	/*
-	 * Not enough space to align at boundary, but allowed to split.
+	 * Not enough space to align at the requested boundary, or
+	 * boundary is smaller than the size, but allowed to split.
 	 * We already checked that start + size does not overlap end.
 	 *
 	 * XXXKIB. It is possible that bs is exactly at the start of



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150316185602.GJ2379>