Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 23 Nov 2017 02:30:33 +0000
From:      bugzilla-noreply@freebsd.org
To:        freebsd-bugs@FreeBSD.org
Subject:   [Bug 223808] zpool attach (and other commands?) fail with misleading error if a wiped disk had previously been used in a RAID array
Message-ID:  <bug-223808-8@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D223808

            Bug ID: 223808
           Summary: zpool attach (and other commands?) fail with
                    misleading error if a wiped disk had previously been
                    used in a RAID array
           Product: Base System
           Version: 11.0-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: freebsd-bugs@FreeBSD.org
          Reporter: stilezy@gmail.com

I'm not sure that this is ZFS-specific, it might be more related to some ot=
her
kernel device handling, or error message improvement. The issue is consiste=
ntly
reproducible in ZFS, I'm not sure what else might cause the same error, and=
 at
what level the error is raised. Probably kernel?

SITUATION:

Suppose a user has (new+used) HDD "spares". Needing a new HDD in their pool,
they wipe the disk and its MBR/GPT as usual, connect it to their server, and
use a standard command such as 'zpool attach <pool> <existing pool device>
<this HDD>'. The wiping might have been done on Windows using DISKPART ->
CLEAN, or using GPART, or any similar tool.

Suppose also that before being wiped, the disk had been used at some time in
the past, with a hardware or "soft" RAID controller. (Probably this issue w=
ould
happen with LSI controllers and most other  firmware/hardware RAID, as they=
 all
store metadata on the HDD, but it does clearly happen with the Intel RST "s=
oft"
or fake RAID).=20

We suppose that the disk is correctly sized, has been MBR/GPT/surface wiped,
and whatever else - there's nothing wrong with the disk, system or command.

EXPECTED RESULT:

Most users will expect the command to work and not think twice.=20

The user expects that disk will be quickly recognised by the system and giv=
en
an identifier ("da5") when connected, and that the commands "zpool attach t=
ank
da1 da5" or "zpool attach tank /dev/da1 /dev/da5" will both work (assuming =
da1
is a suitable existing disk for the command and provided that da5 works and=
 was
wiped before use).

ACTUAL RESULT:

The command, in all its permutations, fails with the obscure error "no such
device or dataset". No other debug info is provided to the user, and none of
the usual reasons for this error provice any help in troubleshooting.

It is also left ambiguous to the user, which of the three items (1 x pool a=
nd 2
x devices) is the one with the problem.

DISCUSSION:

The issue is that when such a disk is plugged in, it's recognised on the ba=
sis
of the old RAID metadata, which is not always wiped by usual disk wipe
processes even if they remove the MBR/GPT and perform a surface data wipe of
all user data. There will still be metadata held for the previous RAID
controller after such wiping.

FreeBSD, in identifying the disk, treats it both as a single device (da5) a=
nd
as a degraded raid array (raid/r0p1). Therefore zpool attach fails, but it =
does
so in a way and with an error message that makes the true cause very obscur=
e.

The problem is that users tend to assume wiping means wiping. But wiping
doesn't always remove RAID metadata, and this can lead to a disk that acts =
in
strange ways, with expected commands failing obscurely.

A second problem is that zpool attach (and perhaps other commands) does not
make clear *which* device is referred to.=20

Does "no such device or dataset" in response to "zpool attach tank da0 da1"
mean the pool (and similarly named dataset) "tank" doesn't exist? Or da0
doesn't exist? Or da1 doesn't exist? Except - other commands show they all =
do
exist. This avoidably confuses the user and would be worth correcting.

ENHANCEMENT:

The error should probably be more specific in all cases: "no such device or
dataset: <name>" so the user knows exactly which item is alleged not to exi=
st.
If it doesn't mean literally that it doesn't exist, then the wording should
convey that it does not exist *or cannot be used*.

If a disk is connected that has RAID metadata but is not part of a known ar=
ray,
the error should probably be more specific: "cannot attach: da5 is part of a
RAID array", and not that it doesn't exist.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-223808-8>