Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 4 Jul 2013 07:55:48 -0700
From:      Jeremy Chadwick <jdc@koitsu.org>
To:        Travis Mikalson <bofh@terranova.net>
Cc:        freebsd-fs@FreeBSD.org
Subject:   Re: Report: ZFS deadlock in 9-STABLE
Message-ID:  <20130704145548.GA91766@icarus.home.lan>
In-Reply-To: <51D586F9.7060508@terranova.net>
References:  <51D45401.5050801@terranova.net> <51D5776F.5060101@FreeBSD.org> <51D57C19.1080906@terranova.net> <51D5804B.7090702@FreeBSD.org> <51D586F9.7060508@terranova.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jul 04, 2013 at 10:30:17AM -0400, Travis Mikalson wrote:
> 
> 
> Andriy Gapon wrote:
> > on 04/07/2013 16:43 Travis Mikalson said the following:
> >> Yes, that helpful article is where I got the run-down on how best to
> >> report what was going on here. I still believe this is an actual
> >> deadlock bug and not a storage layer issue.
> >>
> >> I have not seen any indications of any problems with my storage layer.
> >> You'd think there would be some scary-looking complaint on the console
> >> during one of these deadlocks if it had suddenly lost the capability to
> >> communicate with most or all the disks, but I've deadlocked at least 10
> >> times now in 2013 and never anything of the sort. Thanks to IPMI, I have
> >> actually viewed the console each time it has happened.
> > 
> > Well, I do consider GEOM, CAM, drivers to be parts of the storage layer.
> > In other words, everything below ZFS.
> 
> Ah, I believe I understand. It's not necessarily a hardware issue (which
> is what I took away from the original verbage), the deadlock may have
> occurred in other parts of the storage layer.
> 
> FWIW, my simple UFS compact flash that I boot from also becomes
> inaccessible during these deadlocks. All UFS and ZFS storage goes dead
> simultaneously. If it were purely a ZFS issue, I suppose one might
> expect to still be able to read from their UFS filesystem.

I'd like to get output from all of these commands:

- dmesg  (you can hide/XXX out the system name if you want, but please
  don't remove anything else, barring IP addresses/etc.)

- zpool get all

- zfs get all

- "gpart show -p" for every disk on the system

- "vmstat -i" when the system is livelocked (if possible; see below)

- The exact brand and model string of mps(4) controllers you're using

- The exact firmware version and firmware type (often a 2-letter code)
  you're using on your mps(4) controllers (dmesg might show some of this
  but possibly not all)

- Is powerd(8) running on this system at all?

Please put these in separate files and upload them to
http://tog.net/freebsd/ if you could.  (For the gpart output, you can
put all the output from all the disks in a single file)

I can see your ZFS disks are probably using those mps(4) controllers.  I
also see you have an AHCI controller.

I know you can't move all your disks to the AHCI controller due to there
not being enough ports, and the controller might not even work with SAS
disks (depends, some newer/higher end Intel ones do), but:

A "CF drive locking up too" doesn't really tell us anything about the CF
drive, how it's hooked up, etc...  But I'd rather not even go into that,
because:

Advice:

Hook a SATA disk up to your ahci(4) controller and just leave it there.
No filesystem, just a raw disk sitting on a bus.  When the livelock
happens, in another window issue "dd if=/dev/ada0 of=/dev/null bs=64k"
(disk might not be named ada0; again, need that dmesg) and after a
second or two press Ctrl-T to see if you get any output (output should
be immediate).  If you do get output, it means GEOM and/or CAM are still
functional in some manner, and that puts more focus on the mps(4) side
of things.  There are still nearly infinite explanations for what's
going on though.  Which leads me to...

Question:

If the system is livelocked, how are you running "procstat -kk -a" in
the first place?  Or does it "livelock" and then release itself from the
pain (eventually), only later to re-lock?  A "livelock" usually implies
the system is alive in some way (hitting NumLock on the keyboard
(hopefully PS/2) still toggles the LED (kernel does this -- I've used
this as a way to see if a system is locked up or not for years)) just
that some layer pertaining to your focus (ZFS I/O) is wonky.  If it
comes and goes, there may be some explanations for that, but output from
those commands would greatly help.

Question:

What's with the tunings in loader.conf and sysctl.conf for ZFS?  Not
saying those are the issue, just asking why you're setting those at all.
Is there something we need to know about that you've run into in the
past?

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130704145548.GA91766>