From owner-freebsd-stable@freebsd.org  Mon Jul 24 17:03:21 2017
Return-Path: <owner-freebsd-stable@freebsd.org>
Delivered-To: freebsd-stable@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9F4DCCFE9D3
 for <freebsd-stable@mailman.ysv.freebsd.org>;
 Mon, 24 Jul 2017 17:03:21 +0000 (UTC)
 (envelope-from eugen@grosbein.net)
Received: from hz.grosbein.net (hz.grosbein.net [78.47.246.247])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "hz.grosbein.net", Issuer "hz.grosbein.net" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id ED96A6F91A;
 Mon, 24 Jul 2017 17:03:20 +0000 (UTC)
 (envelope-from eugen@grosbein.net)
Received: from eg.sd.rdtc.ru (root@eg.sd.rdtc.ru [62.231.161.221])
 by hz.grosbein.net (8.15.2/8.15.2) with ESMTPS id v6OH3AAh050328
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Mon, 24 Jul 2017 19:03:10 +0200 (CEST)
 (envelope-from eugen@grosbein.net)
X-Envelope-From: eugen@grosbein.net
X-Envelope-To: markj@FreeBSD.org
Received: from eg.sd.rdtc.ru (eugen@localhost [127.0.0.1])
 by eg.sd.rdtc.ru (8.15.2/8.15.2) with ESMTP id v6OH35Dg064320;
 Tue, 25 Jul 2017 00:03:05 +0700 (+07)
 (envelope-from eugen@grosbein.net)
Subject: Re: stable/11 debugging kernel unable to produce crashdump again
To: Mark Johnston <markj@FreeBSD.org>
References: <587928B3.2050607@grosbein.net>
 <20170113193726.GC77535@wkstn-mjohnston.west.isilon.com>
 <587A0E12.7070205@grosbein.net> <59746BD5.5010301@grosbein.net>
 <20170724014445.GA20872@raichu>
Cc: FreeBSD Stable <freebsd-stable@FreeBSD.org>,
 Alexander Motin <mav@freebsd.org>
From: Eugene Grosbein <eugen@grosbein.net>
X-Enigmail-Draft-Status: N1110
Message-ID: <59762849.5090208@grosbein.net>
Date: Tue, 25 Jul 2017 00:03:05 +0700
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:38.0) Gecko/20100101
 Thunderbird/38.4.0
MIME-Version: 1.0
In-Reply-To: <20170724014445.GA20872@raichu>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=3.6 required=5.0 tests=BAYES_00, DATE_IN_FUTURE_96_Q, 
 LOCAL_FROM autolearn=no autolearn_force=no version=3.4.1
X-Spam-Report: * 3.3 DATE_IN_FUTURE_96_Q Date: is 4 days to 4 months after
 Received: date
 * -2.3 BAYES_00 BODY: Bayes spam probability is 0 to 1%
 *      [score: 0.0000] *  2.6 LOCAL_FROM From my domains
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on hz.grosbein.net
X-Spam-Level: ***
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-stable>, 
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Jul 2017 17:03:21 -0000

CCing mav@ as graid expert.

On 24.07.2017 08:44, Mark Johnston wrote:

>> Sadly, this time 11.1-STABLE r321371 SMP hangs instead of doing crashdump:
>>
>> - "call doadump" from DDB prompt works just fine;
>> - "shutdown -r now" reboots the system without problems;
>> - "sysctl debug.kdb.panic=1" triggers a panic just fine but system hangs just afer showing uptime
>> instead of continuing with crashdump generation; same if "real" panic occurs.
>>
>> Same for debug.minidump set to 1 or 0. How do I debug this?
> 
> I'm not able to reproduce the problem in bhyve using r321401. Looking
> at the code, the culprits might be cngrab(), or one of the
> shutdown_post_sync eventhandlers. Since you're apparently able to see
> the console output at the time of the panic, I guess it's probably the
> latter. Could you try your test with the patch below applied? It'll
> print a bunch of "entering post_sync"/"leaving post_sync" messages with
> addresses that can be resolved using kgdb. That'll help determine where
> we're getting stuck.
> 
> Index: sys/sys/eventhandler.h
> ===================================================================
> --- sys/sys/eventhandler.h	(revision 321401)
> +++ sys/sys/eventhandler.h	(working copy)
> @@ -85,7 +85,11 @@
>  			_t = (struct eventhandler_entry_ ## name *)_ep;	\
>  			CTR1(KTR_EVH, "eventhandler_invoke: executing %p", \
>   			    (void *)_t->eh_func);			\
> +			if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
> +				printf("entering post_sync %p\n", (void *)_t->eh_func); \
>  			_t->eh_func(_ep->ee_arg , ## __VA_ARGS__);	\
> +			if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
> +				printf("leaving post_sync %p\n", (void *)_t->eh_func); \
>  			EHL_LOCK((list));				\
>  		}							\
>  	}								\
> 

Thanks, this helped:

$ addr2line -f -e kernel.debug 0xffffffff80919c00
g_raid_shutdown_post_sync
/home/src/sys/geom/raid/g_raid.c:2458

That is GEOM_RAID's g_raid_shutdown_post_sync() that hangs if called just before
crashdump generation but works just fine during normal system shutdown.

I should note my graid's RAID1 is running in degraded state currently
due to dead SSD module that does not respond. Here is part of boot log:

ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich5: Poll timeout on slot 2 port 0
ahcich5: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd 80 serr 00000000 cmd 0000c217
(aprobe2:ahcich5:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe2:ahcich5:0:0:0): CAM status: Command timeout
(aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted
run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config
ahcich5: Poll timeout on slot 3 port 0
ahcich5: is 00000000 cs 00000008 ss 00000000 rs 00000008 tfd 80 serr 00000000 cmd 0000c317
(aprobe2:ahcich5:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe2:ahcich5:0:0:0): CAM status: Command timeout
(aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted
[skip]
Trying to mount root from ufs:/dev/raid/r0s4a [rw,noatime]...
Root mount waiting for: GRAID-Intel
Root mount waiting for: GRAID-Intel
Root mount waiting for: GRAID-Intel
Root mount waiting for: GRAID-Intel
Root mount waiting for: GRAID-Intel
GEOM_RAID: Intel-c291fe96: Force array start due to timeout.
GEOM_RAID: Intel-c291fe96: Disk ada0 state changed from NONE to ACTIVE.
GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from NONE to STALE.
GEOM_RAID: Intel-c291fe96: Array started.
GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from STALE to ACTIVE.
GEOM_RAID: Intel-c291fe96: Volume r0 state changed from STARTING to DEGRADED.
GEOM_RAID: Intel-c291fe96: Provider raid/r0 for volume r0 created.