Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 13 Aug 2016 13:32:05 -0453.75
From:      "William A. Mahaffey III" <wam@hiwaay.net>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: Monitoring server for crashes
Message-ID:  <398070bd-057f-55bb-2b17-4858f9450c5c@hiwaay.net>
In-Reply-To: <20160813234226.N79687@sola.nimnet.asn.au>
References:  <mailman.115.1471089602.58418.freebsd-questions@freebsd.org> <20160813234226.N79687@sola.nimnet.asn.au>

next in thread | previous in thread | raw e-mail | index | archive | help
On 08/13/16 09:33, Ian Smith wrote:
> In freebsd-questions Digest, Vol 636, Issue 7, Message: 10
> On Fri, 12 Aug 2016 11:51:50 -0400 Robert Fitzpatrick <robert@webtent.org> wrote:
>   > Valeri Galtsev wrote:
>   > > Before doing such monitoring I would really do a good hardware test.
>   > > Incidentally, who is hardware manufacturer (just for my curiosity). The
>   > > usual suspects are: memory (poor/flaky memory, or combination of memory
>   > > with slightly different specs; these even though they may work together
>   > > can lead to failure sometimes very rarely, like once every 6 Months which
>   > > is really hard to troubleshoot: just avoid this). Another possibility:
>   > > tripping temperature threshold set in BIOS. (These, BTW will leave no
>   > > tracks in crash, logs etc.) Check this and bring threshold some 15-20 F (7
>   > > - 10 C ) up.  Incidentally: which CPU(s) do you have? (I'm used to think,
>   > > AMD will withstand any abuse without failing: you almost can boil water on
>   > > these, Intels are not as robust). What I would do is : open the box, leave
>   > > minimal hardware (run with minimal amount of RAM, remove all extra cards
>   > > etc) and see if you have problem with this minimal hardware configuration.
>   > > If not, start adding hardware, install all RAM first, test if it doesn't
>   > > crash. Run memtest96 at this point for at least 48 hours (or at the very
>   > > minimum 2-3 full loops of test). In this configuration try to run system
>   > > and create significant CPU load (several multi-thread "build world" can
>   > > help do that), and simultaneously try to use all the RAM. Things are
>   > > slightly different under heavy load. And so on - add the rest of hardware
>   > > and test... One more thing: check if your PS provides at least 30% more
>   > > power than all hardware may need. Marginally insufficient power may lead
>   > > to unpredictable thing on PCI bus. Incidentally, how old is power supply
>   > > (and the rest of hardware). Electrolytic capacitors may loose capacitance
>   > > with age, thus not filtering well enough ripple on PS leads (capacitors
>   > > inside PS), on CPU power leads and on PCI bus power lines (capacitors on
>   > > system board - check if they do not showing traces of leakage).
>
> All good advice Valeri; not sure about messing with temps in BIOS though
> .. FreeBSD should be handling that ok via ACPI thermal Zones (versus
> _HOT and _CRT temperatures) which should cleanly shutdown at _CRT temp.
> That said, if it gets anywhere near that hot there's a serious issue ..
>
>   > Thanks for all the suggestions, will check temp and other info in BIOS
>   > tonight, I really can't have the server down for long memory test, will
>   > make sure all memory is the same. The server is IBM x3650 with 2 Quad
>   > Core Xeon L5420 a mixture of drives using hardware ServeRAID 8k and 12GB
>   > of RAM. I purchased second hand in 2011. I have a screenshot of the
>   > product data screen in the BIOS, it has a diagnostics date of Aug 2009
>   > in the BIOS, all hardware should be original except drives and memory.
>   > The load comes from a PostgreSQL database primarily, also provides DNS
>   > and LDAP services. Not sure heat is the issue, mainly happens at the
>   > same general time at night, heaviest load is definitely during the day.
>
> I guess you've checked with ibm re a BIOS update .. 2009 is a while ago.
>
> Apart from RAM, which rarely just 'goes bad' esp. on server grade gear,
> but "rarely happens" happens too.
>
> First thing I'd suspect at that age would be the power supply - can you
> swap it with another?  Quickest fix if it works - and it was needed.
>
> Second would be temperature, possibly fan/s - which is also the primary
> cause of blown P/S in my experience.  Below is a script I run from cron
> from 02:59 through 3:09 to record load averages and temperatures through
> daily maintenance from 3:01, every 10 seconds - for debugging a load
> average issue, not relevant here.  Or you can run it over SSH at home,
> and read the last entries over breakfast, whether it crashes or not ..
>
> The lack of any messages - and you should see one if ACPI thermal zone
> detection and forced shutdown is working properly - suggests more of a
> hardware seizure, but at 10 second testing you could see if temps (and
> load) were a problem prior to crash, at least if it happens in a window.
>
>   > I see now, most of the time it happens during dumping of the db each
>   > night, but it has happened once during the day and once a couple of
>   > hours before backup. I'm leaning toward a memory issue and will
>   > definitely visit the data center tonight and see the types. The db size
>   > has not changed much over time and this just started recently. It is a
>   > SpamAssassin/ClamAV db and purges, vacuums every night after dumping. I
>   > will disable and do dump manually tonight, 90% of the time it seems to
>   > be going down during backup of the largest db. Perhaps the crashes have
>   > caused a table to corrupt, I 'fsck -y' all mounts in single user mode
>   > after every crash.
>
> Do the fscks log success or any problems then?  If not, might be worth
> doing manual fsck to check?
>
> /etc/crontab:
> 59      2       *       *       *       root    /root/bin/loadavg_daily
>
> /root/bin/loadavg_daily:
> =======
> #!/bin/sh
> # 19Feb16 loadavg_daily .. every 10 seconds from 02:59 to 03:09 (run by cron)
> log='/root/loadavg_daily.log'
> secs=10
> i=0
> /root/bin/x200stat >> $log	# or something else, or nothing ..
> while [ $i -lt 60 ]; do
>          echo -n "`uptime`  " >> $log
>          echo "`sysctl -n hw.acpi.thermal.tz0.temperature`" \
>          "`sysctl -n hw.acpi.thermal.tz1.temperature`" >> $log
>          sleep $secs
>          i=$((i + 1))
> done
> /root/bin/x200stat >> $log
> echo >> $log
> =======
>
> Check sysctl hw.acpi.thermal for your thermal zones of interest.
>
> HTH, Ian
> _______________________________________________
> freebsd-questions@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org"
>

Out of curiosity, I tried the above command under 9.3R:

[wam@kabini1, ~, 1:30:25pm] 581 % sysctl -n hw.acpi.thermal.tz1.temperature
sysctl: unknown oid 'hw.acpi.thermal.tz1.temperature'
[wam@kabini1, ~, 1:30:46pm] 582 % uname -a
FreeBSD kabini1.local 9.3-RELEASE-p33 FreeBSD 9.3-RELEASE-p33 #0: Wed 
Jan 13 17:55:39 UTC 2016 
root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
[wam@kabini1, ~, 1:31:58pm] 583 %

When did it become available ?

-- 

	William A. Mahaffey III

  ----------------------------------------------------------------------

	"The M1 Garand is without doubt the finest implement of war
	 ever devised by man."
                            -- Gen. George S. Patton Jr.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?398070bd-057f-55bb-2b17-4858f9450c5c>