Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 9 Jan 2012 12:50:54 -0500
From:      John Nielsen <lists@jnielsen.net>
To:        Freddie Cash <fjwcash@gmail.com>
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>
Subject:   Re: Upgrade from 8.2-STABLE to 9.0-RELEASE wedges on SuperMicro H8DGiF-based system
Message-ID:  <F9A87D68-27E4-4872-A2F2-CD3F0F4D1BE4@jnielsen.net>
In-Reply-To: <CAOjFWZ6PbXCBoOinZRvXKmHDM8xWsYU657yPh5-i9TsmnFpdVg@mail.gmail.com>
References:  <CAOjFWZ6PbXCBoOinZRvXKmHDM8xWsYU657yPh5-i9TsmnFpdVg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Jan 9, 2012, at 12:40 PM, Freddie Cash wrote:

> Just wondering if anyone else has run into a similar issue.
>=20
> We have a ZFS storage server that was running 8.2-STABLE (from around
> beginning of Dec 2011) without any issues, that was upgraded to
> 9.0-RELEASE (to consolidate all the ZFS and networking fixes/updates
> and bring it up to version parity with our other ZFS storage server
> running 9.0) last Thursday.  The "svn switch" of the source tree, the
> buildworld, the buildkernel, the installkernel, the reboot with the
> new kernel, the installworld, the reboot into the new world, the
> mergemaster processes all completed successfully.  About half-way
> through the "make delete-old" process, the box locked up.  No messages
> on the console, no log entries of any kind, everything just stopped.
> Had to do a power-cycle.  And then everything went to hell.  :(
>=20
> On reboot, the loader complained about not being able to determine
> which disk it was booting from (even though the new loader had already
> booted at least once), and gave strange messages about
> panic/free/something or other (didn't write that error down).
>=20
> I was able to boot using a 9.0 install CD, drop to a loader prompt,
> unload the kernel/modules from CD, load the kernel/modules from the
> harddrive, set currdev to the harddrive, and boot.  But no matter what
> I did (gpart bootcode using pmbr/gptboot from CD or from HD; copy
> loader from CD, copy /boot from CD), I could not get the loader on the
> HD to load the kernel; always gave the same error message:  can't
> determine which disk we're booting from.
>=20
> After trying for 24 hours to make it work, I just re-installed off the
> 9.0-RELEASE CD.
>=20
> Now, this box (alphadrive) will freeze after running for between 3 and
> 10 hours.  Even when left completely idle, it will lock up after about
> 3 hours.  :(
>=20
> I have another system (betadrive) that's almost identical hardware
> (chassis, backplane, SATA controllers are different, everything else
> is the same) that went from 8.2-STABLE to 9.0-RC2 to 9.0-RC3 to
> 9.0-RELEASE without any issues.  I've tried copying /boot/loader.conf,
> /etc/make.conf, /etc/src.conf, /etc/sysctl.conf, /etc/rc.conf from
> betadrive to alphadrive, without any change in the freezing behaviour.
>=20
> These are ZFS storage systems, with / (UFS) and swap on SSDs, with 16
> or 24 SATA HDs in the pool (3x 5-disk raidz2 + spare and 4x 6-disk
> raidz2 resp).  All of the ZFS settings are identical between the two
> systems (pool name, pool properties, ZFS filesystems, ZFS properties
> per filesystem).  Dedupe and compression (LZJB) are enabled on both
> systems.
>=20
> When alphadrive locks up, there are no entries made in any log files;
> there are no log entries on the console; there are no entries in the
> BIOS event log; there are no entries in the IPMI event log; the
> CPU/case temps are below 40C (emergency shutoff is 75C) as shown via
> IPMI; RAM usage is under 20 GB (24 GB per box) with the lowest being
> under 2 GB used (I run top on the console so I can see the stats when
> it locks up, and the time it locks up).  It just ... stops.
>=20
> The system will even lock up when running in single-user mode, with
> only / mounted (ZFS not loaded, zpool not imported).
>=20
> Hardware (alphadrive):
>  Chenbro 5U rackmount chassis with 24 hot-swap drive bays
>  SuperMicro H8DGi-F motherboard
>  AMD Opteron 2218 CPU (8-cores at 2.0 GHz)
>  24 GB DDR3-SDRAM
>  3x SuperMicro AOC-USAS-L8i SATA controllers (multi-lane break-out =
cables)
>  8x Seagate 7200.12 1.5 TB SATA harddrives
> 16x WD RE4 1.0 TB SATA harddrives
>  1x Kingston 60 GB SSD (for /, swap, L2ARC)
>=20
> Hardware (betadrive):
>  SuperMicro 4U rackmount chassis with 16 hot-swap drive bays
>  SuperMicro H8DGi-F motherboard
>  AMD Opteron 2218 CPU (8-cores at 2.0 GHz)
>  24 GB DDR3-SDRAM
>  2x SuperMicro AOC-USAS2-L8i SATA controllers (multi-lane cables)
> 16x WD RE4 2.0 TB SATA harddrives
>  1x Kingston 60 GB SSD (for /, swap, L2ARC)
>=20
> betadrive runs perfectly with FreeBSD 9.0-RELEASE.
> alphadrive locks up with FreeBSD 9.0-RELEASE.
>=20
> We're currently investigating hardware firmware revisions to see if
> anything else is different between the two systems.
>=20
> Has anyone experience anything similar?  Does anyone have any ideas on
> what to look for?  Any suggestions on what to try next?

=46rom what you've said I strongly suspect that you have some kind of =
hardware issue. Dodgy RAM is my first guess, something cooling-related =
is my 2nd, and PSU is my 3rd. It is a little suspicious that you only =
started having problems after your upgrade but it could be coincidence =
or it could be something about the new software tickling the hardware =
differently than the old.

Open it up, make sure you don't have dust buildup and that all the fans =
are spinning, re-seat the RAM and then boot into memtest for a few =
hours. If you have spare similar hardware you can also try swapping =
components until you isolate the fault.

Good luck,

JN




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F9A87D68-27E4-4872-A2F2-CD3F0F4D1BE4>