Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 1 Jun 2017 17:14:25 +0200
From:      Raimo Niskanen <raimo+freebsd@erix.ericsson.se>
To:        <freebsd-questions@freebsd.org>
Subject:   Re: Advice on kernel panics
Message-ID:  <20170601151425.GF2256@erix.ericsson.se>
In-Reply-To: <33501.128.135.52.6.1496329407.squirrel@cosmo.uchicago.edu>
References:  <mailman.103.1496318402.46813.freebsd-questions@freebsd.org> <20170601235447.C98304@sola.nimnet.asn.au> <33501.128.135.52.6.1496329407.squirrel@cosmo.uchicago.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jun 01, 2017 at 10:03:27AM -0500, Valeri Galtsev wrote:
> On Thu, June 1, 2017 9:34 am, Ian Smith wrote:
> > In freebsd-questions Digest, Vol 678, Issue 4, Message: 4
> > On Thu, 1 Jun 2017 10:27:49 +0200 Raimo Niskanen
> > <raimo+freebsd@erix.ericsson.se> wrote:
> >  > On Thu, Jun 01, 2017 at 12:10:30AM -0500, Doug McIntyre wrote:
> >  > > On Mon, May 29, 2017 at 11:20:43AM +0200, Raimo Niskanen wrote:
> >  > > > I have a server that panics about every 3 days and need some
> advice
> > on how
> >  > > > to handle that.
> >  > >
> >  > > I'd expect it is some sort of hardware failure, as I would expect
> kernel panics more on the order of once a decade with FreeBSD. Ie.
> I've seen one or two on my hundred or so servers, but its pretty
> > rare.
> >  > >
> >  > > Check and recheck your hardware items.
> >  >
> >  > I have removed one of four memory capsules - panicked again.  Will
> > rotate
> >  > through all of them...
> >  >
> >  > >
> >  > > Runup memtest86+. Check your drive hardware, turn on SMART
> checking.
> >  >
> >  > I have run memtest86+ over night - no errors found.
> >  >
> >  > I have installed smartmontools - no errors found, short and long self
> > tests
> >  > on both disks run fine.  zpool scrub repaired 0 errors and has no
> known
> > data
> >  > errors.
> >
> > Everyone's suggesting hardware problems, and it's certainly worthwhile
> eliminating that possibility - but this could be a software/OS issue.
> 
> I would agree with Ian,  it can be software, though it is less likely. I
> have seen a few times that SCSI attached external RAID (attached to LSI
> SCSI HBA) was announcing change of its status (like rebuilt finished or
> drive timed out/failed) which simultaneously with other traffic on SCSI
> bus confused adapter and led to kernel panic.
> 
> That said, I will first check hardware thoroughly. Andrea mentioned aged
> PS under heavy load. And these are prime suspects. Of all components
> electrolytic capacitors are the ones degraded most, may even leak, and
> they don't filter ripple sufficiently, thus leading to ripple beyond
> tolerable at high currents. So:
> 
> 1. open the box, and inspect interior. System board ("motherboard" is its
> jargon name for over 30 years): inspect electrolytic capacitors around
> CPU(s), and those that filter PCI (or PCI-X, or PCI-E) bus power leads.
> Any of them bulged, or even have traces of leaked electrolyte (brown
> residue usually) - throw away system board. The model of your box fall
> into the time span when they used worst electrolytic capacitors.

I did not think this machine was old, but it has apparently been a few
years...

> 
> 2. re-seat all components (including expansion boards, memory, CPU is less
> likely, but I would do that too), disconnect and reconnect all connectors.
> Contacts, even gold plated, sometimes do oxidize

Will try.

> 
> 3. Get new power supply, not necessarily designed for this machine, but
> with the same connectors to the system board, and with higher power
> rating. disconnect box's own PS, and power it from new PS; see if it stops
> failing (PSes do have electrolytic capacitors inside as well; other
> components do not degrade but do not die totally, except for ultra high
> frequency diodes and transistors, and very high voltage diodes)
> 
> Good luck!
> 
> Valeri

Thank you!


-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170601151425.GF2256>