From owner-freebsd-questions@freebsd.org Thu Jun 1 15:15:15 2017 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 84EF3AFF1B0 for ; Thu, 1 Jun 2017 15:15:15 +0000 (UTC) (envelope-from raimo+freebsd@erix.ericsson.se) Received: from sesbmg23.ericsson.net (sesbmg23.ericsson.net [193.180.251.37]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0F2387E067 for ; Thu, 1 Jun 2017 15:15:14 +0000 (UTC) (envelope-from raimo+freebsd@erix.ericsson.se) X-AuditID: c1b4fb25-73a9f9a0000055fe-2b-59302f7f7db2 Received: from ESESSHC020.ericsson.se (Unknown_Domain [153.88.183.78]) by sesbmg23.ericsson.net (Symantec Mail Security) with SMTP id B3.74.22014.F7F20395; Thu, 1 Jun 2017 17:15:11 +0200 (CEST) Received: from duper.otp.ericsson.se (153.88.183.153) by smtp.internal.ericsson.com (153.88.183.80) with Microsoft SMTP Server id 14.3.339.0; Thu, 1 Jun 2017 17:14:27 +0200 Received: from duper.otp.ericsson.se (localhost [127.0.0.1]) by duper.otp.ericsson.se (8.15.2/8.15.2) with ESMTP id v51FEPkE010342 for ; Thu, 1 Jun 2017 17:14:25 +0200 (CEST) (envelope-from raimo+freebsd@erix.otp.ericsson.se) Received: (from raimo@localhost) by duper.otp.ericsson.se (8.15.2/8.15.2/Submit) id v51FEPSR010341 for freebsd-questions@freebsd.org; Thu, 1 Jun 2017 17:14:25 +0200 (CEST) (envelope-from raimo+freebsd@erix.otp.ericsson.se) X-Authentication-Warning: duper.otp.ericsson.se: raimo set sender to raimo+freebsd@erix.ericsson.se using -f Date: Thu, 1 Jun 2017 17:14:25 +0200 From: Raimo Niskanen To: Subject: Re: Advice on kernel panics Message-ID: <20170601151425.GF2256@erix.ericsson.se> Mail-Followup-To: freebsd-questions@freebsd.org References: <20170601235447.C98304@sola.nimnet.asn.au> <33501.128.135.52.6.1496329407.squirrel@cosmo.uchicago.edu> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <33501.128.135.52.6.1496329407.squirrel@cosmo.uchicago.edu> "To: freebsd-questions@freebsd.org" User-Agent: Mutt/1.7.2 (2016-11-26) X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFrrFLMWRmVeSWpSXmKPExsUyM2K7n269vkGkwdQn+hYvv25icWD0mPFp PksAYxSXTUpqTmZZapG+XQJXxsH1v9gKpklXrL+0ibmBsVu0i5GTQ0LAROJQ/xq2LkYuDiGB I4wST7sfM0M4Gxgljm2+wQ7htDFJrJ/0nwnCecIocf/FZVaI/hyJt5+OMoHYLAIqEgc+f2AE sdkETCUaf54BqxERUJb4d+0iM4gtDFSzYus9sBpeoN07d3xkg7D1JT6ueMUKsWAzo8TbA2eZ IRKCEidnPmEBsZkFdCQW7P4E1MABZEtLLP/HARLmFPCQuN3+lx0kzAC0q+2UEIgpCrL2K9gF QgLaEhPeHGCdwCgyC8nMWUhmzkKYuYCReRWjaHFqcVJuupGxXmpRZnJxcX6eXl5qySZGYIgf 3PJbdQfj5TeOhxgFOBiVeHhXahlECrEmlhVX5h5ilOBgVhLhPaIJFOJNSaysSi3Kjy8qzUkt PsQozcGiJM7ruO9ChJBAemJJanZqakFqEUyWiYNTqoHRZDvr4xqdCO6Wq3IzV+SLRxQatc8q X+Hgr5/ow//vwA+Hq++3mtXHthepvVvUuGqmrtG7QyWafzeKfgooLPhsefD2yerHXkWNB/Xq +pTWSvNNDf/z+5JIb+n/x71LDhz8y9gS5TZJLSvQdEsEo6DRpH0X5q8z1Y/XmH/y+6kbv0I2 8mzd/b1fiaU4I9FQi7moOBEAn2AxLm0CAAA= X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Jun 2017 15:15:15 -0000 On Thu, Jun 01, 2017 at 10:03:27AM -0500, Valeri Galtsev wrote: > On Thu, June 1, 2017 9:34 am, Ian Smith wrote: > > In freebsd-questions Digest, Vol 678, Issue 4, Message: 4 > > On Thu, 1 Jun 2017 10:27:49 +0200 Raimo Niskanen > > wrote: > > > On Thu, Jun 01, 2017 at 12:10:30AM -0500, Doug McIntyre wrote: > > > > On Mon, May 29, 2017 at 11:20:43AM +0200, Raimo Niskanen wrote: > > > > > I have a server that panics about every 3 days and need some > advice > > on how > > > > > to handle that. > > > > > > > > I'd expect it is some sort of hardware failure, as I would expect > kernel panics more on the order of once a decade with FreeBSD. Ie. > I've seen one or two on my hundred or so servers, but its pretty > > rare. > > > > > > > > Check and recheck your hardware items. > > > > > > I have removed one of four memory capsules - panicked again. Will > > rotate > > > through all of them... > > > > > > > > > > > Runup memtest86+. Check your drive hardware, turn on SMART > checking. > > > > > > I have run memtest86+ over night - no errors found. > > > > > > I have installed smartmontools - no errors found, short and long self > > tests > > > on both disks run fine. zpool scrub repaired 0 errors and has no > known > > data > > > errors. > > > > Everyone's suggesting hardware problems, and it's certainly worthwhile > eliminating that possibility - but this could be a software/OS issue. > > I would agree with Ian, it can be software, though it is less likely. I > have seen a few times that SCSI attached external RAID (attached to LSI > SCSI HBA) was announcing change of its status (like rebuilt finished or > drive timed out/failed) which simultaneously with other traffic on SCSI > bus confused adapter and led to kernel panic. > > That said, I will first check hardware thoroughly. Andrea mentioned aged > PS under heavy load. And these are prime suspects. Of all components > electrolytic capacitors are the ones degraded most, may even leak, and > they don't filter ripple sufficiently, thus leading to ripple beyond > tolerable at high currents. So: > > 1. open the box, and inspect interior. System board ("motherboard" is its > jargon name for over 30 years): inspect electrolytic capacitors around > CPU(s), and those that filter PCI (or PCI-X, or PCI-E) bus power leads. > Any of them bulged, or even have traces of leaked electrolyte (brown > residue usually) - throw away system board. The model of your box fall > into the time span when they used worst electrolytic capacitors. I did not think this machine was old, but it has apparently been a few years... > > 2. re-seat all components (including expansion boards, memory, CPU is less > likely, but I would do that too), disconnect and reconnect all connectors. > Contacts, even gold plated, sometimes do oxidize Will try. > > 3. Get new power supply, not necessarily designed for this machine, but > with the same connectors to the system board, and with higher power > rating. disconnect box's own PS, and power it from new PS; see if it stops > failing (PSes do have electrolytic capacitors inside as well; other > components do not degrade but do not die totally, except for ultra high > frequency diodes and transistors, and very high voltage diodes) > > Good luck! > > Valeri Thank you! -- / Raimo Niskanen, Erlang/OTP, Ericsson AB