Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 02 Nov 2021 13:57:33 +0000
From:      bugzilla-noreply@freebsd.org
To:        scsi@FreeBSD.org
Subject:   [Bug 240145] [smartpqi][zfs] kernel panic with hanging vdev
Message-ID:  <bug-240145-5313-1gaZ4b2fod@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-240145-5313@https.bugs.freebsd.org/bugzilla/>
References:  <bug-240145-5313@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D240145

Palle Girgensohn <girgen@FreeBSD.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |girgen@FreeBSD.org

--- Comment #45 from Palle Girgensohn <girgen@FreeBSD.org> ---
Hi!

I also have problems with this controller. With 13.0 installed, it crashed
quite quickly on just IO intermediate load. After upgrading to -STABLE on
October 12 2021, the system is quite stable, BUT, when restoring postgresql
databases with pg_restore -j 5 (five writes in parallel), the database later
reports checksum errors when reading some blocks back.

This seems to happen mainly for big database indexes that where generated in
parallel.

I didn't notice until I took a pg_basebackup because postgresql does not
validate the checksum until it is read.

Sorry, lots of database methods, not necessarily common knowledge for scsi
experts. A pg_basebackup basically copies all the files, quite similar to an
rsync, but optiionally also validates a CRC checksum, that was calculated f=
or
each block was they where written, as it reads the data

pg_restore reads a database dumps, writes all the data to disk and creates =
the
indexes using sql create index commands, that is, looking the written files=
 and
calculates the index and writes them.

For about 1,3 TB of database data, the system had 2324 blocks with checksum
errors. All but two of them where with indexes, which kind of suggest that =
this
*could* be a postgresql issue, but given the amount of users using postgres=
ql
as opposed to the amount of users using this controller with freebsd, I'm
reluctant to discredit postgresql here. We should have heard of it if there=
 was
a problem with postgresql?

Since most errors where with the indexes, they could be reindexed, and the =
one
data table that was broken, I managed to fix, so at the moment my data seem=
s to
be safe, but I do not trust this controller-driver-OS combo much at the mom=
ent.=20

Anything I can do to help find a solution to the problem? I'm considering
moving the databases back to an old "trusted" box, so if it could help, I c=
ould
perhaps supply you with a login to the box in a week or so? Would that help=
? It
has an ILO for remote console as well.

I am using the built in RAID:

$ dmesg |grep -i smart
smartpqi0: <P408i-a SR Gen10> port 0x8000-0x80ff mem 0xe6c00000-0xe6c07fff =
at
device 0.0 numa-domain 0 on pci9
smartpqi0: using MSI-X interrupts (32 vectors)
da0 at smartpqi0 bus 0 scbus0 target 0 lun 1
da1 at smartpqi0 bus 0 scbus0 target 0 lun 2
ses0 at smartpqi0 bus 0 scbus0 target 72 lun 0
ses0: <HPE Smart Adapter 3.53> Fixed Enclosure Services SPC-3 SCSI device
pass3 at smartpqi0 bus 0 scbus0 target 1088 lun 1

$ sudo camcontrol devlist
<HPE RAID 1(1+0) OK>               at scbus0 target 0 lun 1 (pass0,da0)
<HPE RAID 1(1+0) OK>               at scbus0 target 0 lun 2 (pass1,da1)
<HPE Smart Adapter 3.53>           at scbus0 target 72 lun 0 (ses0,pass2)
<HPE P408i-a SR Gen10 3.53>        at scbus0 target 1088 lun 1 (pass3)
<Generic- SD/MMC CRW 1.00>         at scbus1 target 0 lun 0 (da2,pass4)

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-240145-5313-1gaZ4b2fod>