Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Jun 2013 19:53:19 +0700
From:      Adam Strohl <adams-freebsd@ateamsystems.com>
To:        freebsd-stable@freebsd.org
Subject:   Re: shutdown -r / shutdown -h / reboot all hang and don't cleanly dismount
Message-ID:  <51C1A9BF.8030304@ateamsystems.com>
In-Reply-To: <20130619122143.GA70813@icarus.home.lan>
References:  <51C1979D.3010305@ateamsystems.com> <20130619122143.GA70813@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On 6/19/2013 19:21, Jeremy Chadwick wrote:
> On Wed, Jun 19, 2013 at 06:35:57PM +0700, Adam Strohl wrote:
>> Hello -STABLE@,
>>
>> So I've seen this situation seemingly randomly on a number of both
>> physical 9.1 boxes as well as VMs for I would say 6-9 months at
>> least.  I finally have a physical box here that reproduces it
>> consistently that I can reboot easily (ie; not a production/client
>> server).
>>
>> No matter what I do:
>>
>> reboot
>> shutdown -p
>> shutdown -r
>>
>> This specific server will stop at "All buffers synced" and not
>> actually power down or reboot.  KB input seems to be ignored.  This
>> server is a ZFS NAS (with GMIRROR for boot blocks) but the other
>> boxes which show this are using GMIRRORs for root/swap/boot (no
>> ZFS).
>>
>> Here is what happens on the console: http://i.imgur.com/1H8JMyB.jpg
>>
>> When I reset the server it appears that disks were not dismounted
>> cleanly ... on this ZFS box it comes back quick because ZFS is good
>> like that but on the other servers with GMIRROR roots rebuilding the
>> GMIRROR and fscking at the same time is murder on the
>> disk/performance until it finishes.
>
> 1. You mention "as well as VMs".  Anything under a "virtual machine" or
> under a hypervisor is going to be very, very, **VERY** different than
> bare metal.  So I hope the issues you're talking about above are on bare
> metal -- I will assume so.

Nope, I see basically the same thing sometimes under ESXi 5.0 Hypervisor 
(and yes it worries me the implications of something so broad).  Those 
unites I just haven't been able to isolate on a server which isn't 
critical.  Lets focus on this server for now though per your suggestion 
below.

>
> 2. We need to know what version of "9.1" you're using, i.e. 9.1-RELEASE.
> If you use stable/9 (RELENG_9) we need to see uname -a output (you can
> hide the machine name if you want).

Sorry, this ZFS box is 9.1-R P4 (kernel built today):

FreeBSD ilos.dsn 9.1-RELEASE-p4 FreeBSD 9.1-RELEASE-p4 #6: Wed Jun 19 
15:31:12 ICT 2013     root@hostname:/usr/obj/usr/src/sys/ATEAMSYSTEMS  amd64

>
> 3. Can we please have dmesg from this machine?  The controller and some
> other hardware details matter.

Sure take a look at the full log here: http://pastebin.com/k55gVVuU

This includes a boot, then a reboot as I describe (you can see it logs 
the All Buffers Synced, etc) then powering back on.

>
> 4. Does "sysctl hw.usb.no_shutdown_wait=1" help you?

Weirdly this allowed it to reboot on the first try (without needing to 
be reset), but not the second.  The "Starting background file system 
checks in 60 seconds" message appeared ... that only happens when 
something is dirty, right?

So the second try with just this I could ctrl alt del it and it 
responded .. kind of:
http://i.imgur.com/POAIaNg.jpg

Still had to reset it though.

>
> 5. Does "sysctl hw.acpi.handle_reboot=1" help you?

No change, still responded to a ctrl alt del like above, but like that 
still needs to be reset and comes back dirty.

>
> 6. Does "sysctl hw.acpi.disable_on_reboot=1" help you?

No change.  Same as above, ctrl alt del responds but needs a hard reset 
still.

>
> 7. If none of the above helps, can you please boot verbose mode and then
> when the system "locks up" on "shutdown -r now" take a picture of the
> VGA console?

Lots of debug on boot obviously but not much different on shutdown/hang:
http://i.imgur.com/SgzSsoP.jpg

>
> 8. Does the machine run moused(8) (check the process list please, do not
> rely on rc.conf) ?

ps -auxww | grep moused reveals nothing running (which is how I have 
things set).

>
>> Another interesting thing is that this particular server runs slapd
>> (OpenLDAP) which, when it comes back up, has a "corrupted" DB
>> (easily fixed with db_recover, but still).  This might be because FS
>> commits aren't happening at the end.   I can even manually stop
>> slapd (service slapd stop) then run sync(8) (I assume this does
>> something for ZFS too) and it still comes back as hosed if I reboot
>> shortly after.  If I start/stop slapd it's fine.  So I feel like
>> there is an FS/dismount thing going on here.
>
> sync(8) does not do what you think it does.  Please read (not skim) this
> entire thread starting here:
>
> http://lists.freebsd.org/pipermail/freebsd-fs/2013-April/thread.html#16982
> http://lists.freebsd.org/pipermail/freebsd-fs/2013-April/016982.html

Groking this now ..

>
> Your problem is related to unclean shutdown; fix that and your issues go
> away.

Yeah that is my feeling as well.

>
>> Additional information: I also have some boxes which will reboot
>> (ie; they don't freeze like some do at the end) but they don't
>> dismount cleanly either and have to rebuild both GMIRROR and fsck.
>> This might be a different issue, too.
>
> Every issue needs to be handled/treated separately.

Sure, I just had run across some threads about that but will focus on 
this ZFS box (and see if anything that fixes here does anything with 
that once I can reliably reproduce it out of production).

>


-- 
Adam Strohl
http://www.ateamsystems.com/



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51C1A9BF.8030304>