Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 17 Jul 2008 00:14:16 +0200
From:      Roland Smith <rsmith@xs4all.nl>
To:        Jo Rhett <hostmaster@netconsonance.com>
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>
Subject:   Re: how to get more logging from GEOM?
Message-ID:  <20080716221416.GA39265@slackbox.xs4all.nl>
In-Reply-To: <6AA8BC91-AF84-4CC7-B6BE-4CA84D82EC1E@netconsonance.com>
References:  <C278655C-4FFB-4A8E-9501-2B84283E324D@netconsonance.com> <20080711155831.GA72963@slackbox.xs4all.nl> <6AA8BC91-AF84-4CC7-B6BE-4CA84D82EC1E@netconsonance.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--UlVJffcvxoiEqYs2
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jul 16, 2008 at 02:41:28PM -0700, Jo Rhett wrote:
> On Jul 11, 2008, at 8:58 AM, Roland Smith wrote:
> >> After about 2 weeks of watching it carefully I've learned almost
> >> nothing.  It's not a disk failure (AFAIK) it's not cpu overheat (now
> >> running healthd without complaints) it's not based on any given
> >> network traffic...  however it does appear to accompany heavy cpu/=20
> >> disk
> >> activity.  It usually dies when indexing my websites at night (but =20
> >> not
> >> always) and it sometimes dies when compiling programs.   Just heavy
> >> disk isn't enough to do the job, as backups proceed without
> >> problems.   Heavy cpu by itself isn't enough to do it either.  But if
> >> I start compiling things and keep going a while, it will eventually
> >> hang.
> >
> >> Is there anything else I should be looking at?
> >
> > Power supply or motherboard would be my first guess.
>=20
>=20
> If the system went offline, I agree.  But it's clearly a kernel =20
> deadlock, since the system remains pingable, answers TCP connections, =20
> etc etcc.... but doesn't respond.=20

Ah. Well, you did said the system 'dies', not 'becomes unresponsive'.

> No TCP negotiation, no response on =20
> the console, etc.   It's higher level activity which isn't working...

Try compiling a kernel with debugging options e.g. WITNESS(4), MUTEX_DEBUG,
LOCK_PROFILING, DIAGNOSTIC and INVARIANTS. See /usr/src/sys/conf/NOTES

This will create a lot of messages in the dmesg output.=20

If you can hook the system up to another machine via serial console, you
might be able to debug the kernel. Read the kernel debugging chapter in
the Developers' Handbook.

Another tip is to create a cron job that makes log entries every couple
of minutes with logger. This might help you pinpoint the exact time of
the mishap, to correlate it to other system activity.

Be _really_ sure that it isn't hardware though. Otherwise you'll be led
on a merry goose chase looking for software errors that aren't there. If
you can restore a backup of this machine's software to a similar one, do
so and see if the hangs persist. If they don't, it's hardware.

Roland
--=20
R.F.Smith                                   http://www.xs4all.nl/~rsmith/
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)

--UlVJffcvxoiEqYs2
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (FreeBSD)

iEYEARECAAYFAkh+crgACgkQEnfvsMMhpyUlYwCcCkE8cT0y1tvhEe/xtVrRwKXT
8HwAmwQ6JniwPgb/NyxHuRfXbwQtN2dA
=vi47
-----END PGP SIGNATURE-----

--UlVJffcvxoiEqYs2--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080716221416.GA39265>