Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 21 Jun 2011 15:32:43 -0400
From:      Paul Mather <paul@gromit.dlib.vt.edu>
To:        Nathan Whitehorn <nwhitehorn@freebsd.org>
Cc:        freebsd-ppc@freebsd.org
Subject:   Re: Xserve G5 keeps shutting down
Message-ID:  <E5EE3F19-79AB-417C-A7EE-0F95CE9DB921@gromit.dlib.vt.edu>
In-Reply-To: <4DFFDEEE.40200@freebsd.org>
References:  <38D89FC6-13F1-4AEF-AF41-0A377EE49DC4@gromit.dlib.vt.edu> <4DFFDEEE.40200@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 20, 2011, at 7:59 PM, Nathan Whitehorn wrote:

> On 06/20/11 15:22, Paul Mather wrote:
>> I'm running FreeBSD/powerpc64 -CURRENT on an Xserve G5.  With a =
recent kernel, the system will not stay up for more than a few hours at =
a time. :-(
>>=20
>> I have no idea why the machine is shutting off.  There is no panic or =
crash dump and there is no indication in the logs of anything awry.  The =
system just powers down.  The times this has happened when I have been =
there have not indicated anything stressing the system (like all fans =
racing madly) and oftentimes the system has been relatively idle.  =
(Oddly, it never appears to my knowledge to have shut down when doing =
sometime potentially taxing, such as a make -j5 buildworld or the =
likes.)
>>=20
>> The main thing I have noticed since building this new kernel is that =
the fans are now controlled automatically, i.e., there is now no need =
for the tickle-the-fan-controller cron job of yore, meaning the fans =
won't race when in single user mode (e.g., during an installworld).
>=20
> If the temperature on any sensor exceeds its maximum value, it will =
cause the machine to shut off. There was at one point a problem with =
some of the sensor drivers that would would report erroneous crazy =
values sometimes. Most of the known problems were fixed andreast a few =
weeks ago, but it looks like you ran into another. My work desktop has a =
ds1775 and a max6690, and has no problems, but not an ad7417, so I would =
guess the problem lies there. Could you try commenting out line 116 of =
/sys/powerpc/powermac/powermac_thermal.c? That will cause it to spam the =
console (and dmesg) about the error, identifying the sensor, but not =
shut off the machine and so both keep your server on and let us work out =
the problem.


I built a new kernel with the shutdown line identified above commented =
out.  The resultant system stayed up for several hours doing various -j5 =
buildworld/buildkernels but just now shut down. :-(  Unfortunately, =
nothing appeared on the console, so there is no logged reason for the =
shutdown.

I started up the system again, but it shut down again after a few =
minutes of uptime.  When I started it up for the third (and last time), =
I managed to grab this output from the temp/fan sysctls before it shut =
down (a minute or two after booting up):

paul@backup:/home/paul> sysctl -a | egrep 'dev.*temp|fans'
machdep.manage_fans: 1
dev.max6690.0.%pnpinfo: name=3Dtemp-monitor compat=3Dmax6690
dev.max6690.0.sensor.sys_ctrlr_ambient.temp: 41.5C
dev.max6690.0.sensor.sys_ctrlr_internal.temp: 50.1C
dev.fcu.0.fans.cpu_a_1.minrpm: 1200
dev.fcu.0.fans.cpu_a_1.maxrpm: 14000
dev.fcu.0.fans.cpu_a_1.rpm: 1984
dev.fcu.0.fans.cpu_a_2.minrpm: 1200
dev.fcu.0.fans.cpu_a_2.maxrpm: 14000
dev.fcu.0.fans.cpu_a_2.rpm: 1984
dev.fcu.0.fans.cpu_a_3.minrpm: 1200
dev.fcu.0.fans.cpu_a_3.maxrpm: 14000
dev.fcu.0.fans.cpu_a_3.rpm: 1984
dev.fcu.0.fans.cpu_b_1.minrpm: 1200
dev.fcu.0.fans.cpu_b_1.maxrpm: 14000
dev.fcu.0.fans.cpu_b_1.rpm: 1984
dev.fcu.0.fans.cpu_b_2.minrpm: 1200
dev.fcu.0.fans.cpu_b_2.maxrpm: 14000
dev.fcu.0.fans.cpu_b_2.rpm: 1984
dev.fcu.0.fans.cpu_b_3.minrpm: 1200
dev.fcu.0.fans.cpu_b_3.maxrpm: 14000
dev.fcu.0.fans.cpu_b_3.rpm: 1984
dev.fcu.0.fans.sys_ctrlr_fan.minpwm: 40
dev.fcu.0.fans.sys_ctrlr_fan.maxpwm: 100
dev.fcu.0.fans.sys_ctrlr_fan.pwm: 54
dev.fcu.0.fans.sys_ctrlr_fan.rpm: 11264
dev.fcu.0.fans.pci_fan.minpwm: 40
dev.fcu.0.fans.pci_fan.maxpwm: 100
dev.fcu.0.fans.pci_fan.pwm: 48
dev.fcu.0.fans.pci_fan.rpm: 9792
dev.ad7417.0.sensor.cpu_a_ad7417_amb.temp: 36.7C
dev.ad7417.0.sensor.cpu_a_diode_temp.temp: 53.8C
dev.ad7417.1.sensor.cpu_b_ad7417_amb.temp: 32.0C
dev.ad7417.1.sensor.cpu_b_diode_temp.temp: 52.6C
dev.ds1775.0.%pnpinfo: name=3Dtemp-monitor compat=3Dlm75


The cpu_{a,b}_diode_temp temperatures were higher during the buildworld =
(63--67C) and it stayed up at that time.

I'm flummoxed at this point as to what is responsible for the shutdowns. =
 Are there any other hardware monitoring-related shutdowns in the kernel =
code?  The funny thing about the ad7417 device is that I only recently =
added it to my kernel config file as I noticed it had appeared in =
GENERIC.

Tomorrow I'll build a GENERIC kernel with the shutdown line commented =
out, and see if I have any better luck with that.

Cheers,

Paul.





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E5EE3F19-79AB-417C-A7EE-0F95CE9DB921>