Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 Oct 2017 22:52:34 +0200
From:      InterNetX - Juergen Gotteswinter <juergen.gotteswinter@internetx.com>
To:        freebsd-fs@freebsd.org
Subject:   Re: ZFS stalled after some mirror disks were lost
Message-ID:  <6d1c80df-7e9f-c891-31ae-74dad3f67985@internetx.com>
In-Reply-To: <DFD0528D-549E-44C9-A093-D4A8837CB499@gmail.com>
References:  <4A0E9EB8-57EA-4E76-9D7E-3E344B2037D2@gmail.com> <DDCFAC80-2D72-4364-85B2-7F4D7D70BCEE@gmail.com> <82632887-E9D4-42D0-AC05-3764ABAC6B86@gmail.com> <20171007150848.7d50cad4@fabiankeil.de> <DFD0528D-549E-44C9-A093-D4A8837CB499@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help


Am 07.10.2017 um 15:57 schrieb Ben RUBSON


>> Indeed. In the face of other types of errors as well, though.
>>
>>>> Essentially, each logical i/o request obtains a configuration lock of
>>>> type 'zio' in shared mode to prevent certain configuration changes
>>>> from happening while there are any outsanding zio-s.
>>>> If a zio is lost, then this lock is leaked.
>>>> Then, the code that deals with vdev failures tries to take this lock in
>>>> exclusive mode while holding a few other configuration locks also in
>>>> exclsuive mode so, any other thread needing those locks would block.
>>>> And there are code paths where a configuration lock is taken while
>>>> spa_namespace_lock is held.
>>>> And when spa_namespace_lock is never dropped then the system is close
>>>> to toast, because all pool lookups would get stuck.
>>>> I don't see how this can be fixed in ZFS.  
>>
>> While I haven't used iSCSI for a while now, over the years I've seen
>> lots of similar issues with ZFS pools located on external USB disks

<3

>> and ggate devices (backed by systems with patches for the known data
>> corruption issues).

Ben started a discussion about his setup a few months ago, where he
described what he is going to do. And, at least my (and i am pretty sure
there where others, too) prophecy was that it will end up in a pretty
unreliable setup  (gremlins and other things are included!) which is far
far away from being helpful in term of HA. A single node setup, with
reliable hardware configuration and as little as possible moving parts,
whould be way more reliable and flawless.

but anyhow, i hate (no i dont in that case) to say "told you so"

its like using tons of external usb disks hooked up to flaky consumer
grade controllers, creating a raidz on top of it and start looking
surprised when the pool starts going crazy.

sorry for being a ironic dick, its frustrating...

> 
> There's no mention to code revision in this thread.
> It finishes with a message from Alexander Motin :
> "(...) I've got to conclusion that ZFS in many places
> written in a way that simply does not expect errors. In such cases it
> just stucks, waiting for disk to reappear and I/O to complete. (...)"
> 

yep, nothing new? if the underlying block device works like expected a
error should be returned to zfs, but noone knows what happens in this
setup during failure. maybe its some switch issue, or network driver bug
which prevents this and stalls the pool. who knows, what errors probably
already got masked through the additional iscsi layer between phys. disk
and zfs. lots of fun included for future issues.

for a ha setup, its funny to add as much components as possible just to
somehow get it look like ha. its the complete opposite, no matter what
one would call it.

>> I'm not claiming that the patch or other workarounds I'm aware of
>> would actually help with your ZFS stalls at all, but it's not obvious
>> to me that your problems can actually be blamed on the iSCSI code
>> either.
>>
i would guess that its not an direct issue of the ctld, but its just a
guess...

@ben

can you post your iscsi network configuration including ctld.conf and so
on? is your iscsi setup using multipath, lacp or is it just single pathed?

overall...

ben is stacking up way too much layers which prevent root cause diagnostic.

lets see, i am trying to describe what i see here (please correct me if
this setup is different from my thinking). i think, to debug this mess,
its absolutely necessary to see all involved components

- physical machine(s), exporting single raw disks via iscsi to
"frontends" (please provide exact configuration, software versions, and
built in hardware -> especially nic models, drivers, firmware)

- switch infrastructure (brand, model, firmware version, line speed,
link aggregation in use? if yes, lacp or whatever is in use here?)

- single switch or stacked setup?

- frontend boxes, importing the raw iscsi disks for a zpool (again,
exact hardware configuration, network configuration, driver / firmware
versions and so on)

did one already check the switch logs / error counts?

another thing which came to my mind is, if has zfs ever been designed to
be used on top of iscsi block devices? my thoughts so far where that zfs
loves native disks, without any layer between (no volume manager, no
partitions, no nothing). most ha setups i have seen so far where using
rock solid cross over cabled sas jbods with on demand activated paths in
case of failure. theres not that much that can cause voodoo in such
setups, compared to iscsi ha however failover scenarios with tons of
possible problematic components in between.

>> Did you try to reproduce the problem without iSCSI?

i bet the problem wont occur anymore on native disks. which should NOT
mean that zfs cant be used on iscsi devices, i am pretty sure it will
work fine... as long as:

- iscsi target behavior is doing well, which includes that no strange
bugs start partying on your san network
- rock solid stable networking, no hickups, no retrans, no loss, no
nothing (dont forget to 100% seperate san traffic from other traffic,
better go for completely dedicated switches which only handle san traffic)
- no nic driver / firmware issues
- no switch firmware issues

the point is, with this kind of setup you get that much components into
the game that its nearly impossible to figure out where the vodoo comes
from. its getting even more worse with ha setups using such a
infrastructure, which need to be debugged while stay into production.


>> Anyway, good luck with your ZFS-on-iscsi issue(s).

good luck from me, too. whould be very interesting which caused this
issue. until the next one pops up

> 
> Thank you very much Fabian for your help and contribution,
> I really hope we'll find the root cause of this issue,
> as it's quite annoying in a HA-expected production environment :/
> 

the point with ha setups is... planning, planning, planning, testing,
testing, praying & hoping that nothing unexpected will happen to your
setup. its always somehow a gamble, and in there will be still enough
situations where a well planned ha setup will still fail.

but, usually its a pretty good starting point to keep things as simple
as possible when designing such setups. extra layers like iscsi in this
case are nothing which should be seen as "keeping things simple", this
things are a good way to prepare a ha setup to fail with whatever
obscure issues. debugging is a bit...

overall, please forgive my ironic writing




> Ben
> 
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?6d1c80df-7e9f-c891-31ae-74dad3f67985>