Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Oct 2016 08:52:24 +0200
From:      Andrea Venturoli <ml@netfence.it>
To:        "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org>
Subject:   Nightly disk-related panic since upgrade to 10.3
Message-ID:  <e923a01a-0739-1fc6-32aa-3a1658cd9e7f@netfence.it>

next in thread | raw e-mail | index | archive | help
Hello.

Last week I upgraded a 9.3/amd64 box to 10.3: since then, it crashed and 
rebooted at least once every night.

The only exception was on Friday, when it locked without rebooting: it 
still answered ping request and logins through HTTP would half work; I'm 
under the impression that the disk subsystem was hung, so ICMP would 
work since it does no I/O and HTTP too worked as far as no disk access 
was required.

Today I was able to get a couple of (almost identical) dumps:

> cpuid = 1
> KDB: stack backtrace:
> #0 0xffffffff804ee170 at kdb_backtrace+0x60
> #1 0xffffffff804b4576 at vpanic+0x126
> #2 0xffffffff804b4443 at panic+0x43
> #3 0xffffffff8068fd2a at softdep_deallocate_dependencies+0x6a
> #4 0xffffffff805394b5 at brelse+0x145
> #5 0xffffffff8053793c at bufwrite+0x3c
> #6 0xffffffff806ae20f at ffs_write+0x3df
> #7 0xffffffff8076d519 at VOP_WRITE_APV+0x149
> #8 0xffffffff806ec7c9 at vnode_pager_generic_putpages+0x2a9
> #9 0xffffffff8076f3b7 at VOP_PUTPAGES_APV+0xa7
> #10 0xffffffff806ea6f5 at vnode_pager_putpages+0xc5
> #11 0xffffffff806e17f8 at vm_pageout_flush+0xc8
> #12 0xffffffff806db432 at vm_object_page_collect_flush+0x182
> #13 0xffffffff806db1cd at vm_object_page_clean+0x13d
> #14 0xffffffff806dadbe at vm_object_terminate+0x8e
> #15 0xffffffff806eac60 at vnode_destroy_vobject+0x90
> #16 0xffffffff806b4232 at ufs_reclaim+0x22
> #17 0xffffffff8076e5c7 at VOP_RECLAIM_APV+0xa7



Has anyone any better insight on what might be going on?
The disks are all connected to a SAS RAID adapter running on mfi; I 
don't think it might be an hardware issue, since it has worked perfectly 
for years until I did the upgrade; also mfiutil says everything is ok 
and nothing mfi-related is in the logs.



Some ideas come to mind about which I might use a second opinion:

_ soft-update is broken: that would really surprise me, since I've been 
using that for years on this and several other boxes (10.3 too);

_ snapshot creation/deletion is causing this: again I'm using that 
almost anywhere, so I don't think this might be the cause alone; 
besides, I've been able to do some dumps without trouble and I don't 
think anything was messing with snapshots at the time of the last two 
panics;

_ mfi driver is broken on 10.3: this is more reasonable to me, since 
this is the only machine I have it on and it's the only case where I get 
this panics.
I found https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=183618, but I 
get no "g_vfs_done()..." messages.

Any other hint?



I'd really like to find out what's going on, I'll appreciate any help 
and I'm willing to provide any useful info.

On the other hand, this is a production server, so I have to solve this 
really soon.
Some idea comes to mind, like disabling softupdate (knowing which file 
system was having trouble would help here; is there any way to know?), 
trying to enable journaling, upgrading to 10-STABLE, build a kernel with 
INVARIANTS/WITNESS/etc..., but I'd appreciate a second opinion before I 
start shooting in the dark.



  bye & Thanks
	av.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?e923a01a-0739-1fc6-32aa-3a1658cd9e7f>