From owner-freebsd-stable@freebsd.org Mon Nov 26 12:35:07 2018 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 85B581104783 for ; Mon, 26 Nov 2018 12:35:07 +0000 (UTC) (envelope-from eugen@grosbein.net) Received: from hz.grosbein.net (hz.grosbein.net [IPv6:2a01:4f8:d12:604::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "hz.grosbein.net", Issuer "hz.grosbein.net" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id C6FB4710C0 for ; Mon, 26 Nov 2018 12:34:56 +0000 (UTC) (envelope-from eugen@grosbein.net) Received: from eg.sd.rdtc.ru (eg.sd.rdtc.ru [IPv6:2a03:3100:c:13:0:0:0:5]) by hz.grosbein.net (8.15.2/8.15.2) with ESMTPS id wAQCYlT0054545 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 26 Nov 2018 13:34:48 +0100 (CET) (envelope-from eugen@grosbein.net) X-Envelope-From: eugen@grosbein.net X-Envelope-To: gerrit.kuehn@aei.mpg.de Received: from [10.58.0.4] (dadv@[10.58.0.4]) by eg.sd.rdtc.ru (8.15.2/8.15.2) with ESMTPS id wAQCYl6A045944 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Mon, 26 Nov 2018 19:34:47 +0700 (+07) (envelope-from eugen@grosbein.net) Subject: Re: high cpu irq load and slow boot after update from 10.4 to 11.2 To: =?UTF-8?Q?Gerrit_K=c3=bchn?= , freebsd-stable@freebsd.org References: <20181126094648.510fc7f7b773bfdac546d037@aei.mpg.de> From: Eugene Grosbein Message-ID: <007b9007-6abb-15cf-45df-45b3da814e5d@grosbein.net> Date: Mon, 26 Nov 2018 19:34:43 +0700 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <20181126094648.510fc7f7b773bfdac546d037@aei.mpg.de> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=0.3 required=5.0 tests=BAYES_00,LOCAL_FROM,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Report: * -2.3 BAYES_00 BODY: Bayes spam probability is 0 to 1% * [score: 0.0000] * -0.0 SPF_PASS SPF: sender matches SPF record * 2.6 LOCAL_FROM From my domains X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on hz.grosbein.net X-Rspamd-Queue-Id: C6FB4710C0 X-Spamd-Result: default: False [-4.94 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; MX_INVALID(0.50)[greylisted]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[grosbein.net]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; R_SPF_PERMFAIL(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; NEURAL_HAM_SHORT(-0.98)[-0.979,0]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; IP_SCORE(-2.36)[ip: (-5.47), ipnet: 2a01:4f8::/29(-3.51), asn: 24940(-2.80), country: DE(-0.01)]; ASN(0.00)[asn:24940, ipnet:2a01:4f8::/29, country:DE]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_TLS_ALL(0.00)[] X-Rspamd-Server: mx1.freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Nov 2018 12:35:07 -0000 26.11.2018 15:46, Gerrit Kühn wrote: > A couple of weeks ago, I updated an older storage server (2 CPUs, 4 cores > each, 48GB RAM, 36x4GB HDDs, 3 LSI-based mps controllers) from 10.4 to > 11.2. The first thing I noticed was that booting takes much longer now. The > system probes each HDD (there are 36 of them, attached to mps controllers) > very slowly multiple times (I can see the light of each disk blinking, > it takes seconds to go on to the next disk), the whole process takes > several minutes (was much faster before). > > A more nasty issue appears after a couple of weeks of operation (so far, > roughly between 15 and 30 days): > Suddenly there is a very high irq load on one of the CPU cores > (cpu:timer), causing high system load and high cpu load (top easily > shows average load over 10, whereas it was always below 1 before). I cannot > find any process or device as a culprit. First I thought this problem can > only be made to go away by rebooting, but now I managed to get rid of it > (at least for some time, don't know if or when it will be back) while > checking out the latest source in background (I actually intended to fiddle > with some kernel settings, but suddenly the issue was gone after > persisting permanently over the weekend), causing. > > Looking around, I found a couple of vaguely similar reports (like > https://lists.freebsd.org/pipermail/freebsd-current/2017-January/064419.html), > but these all appear to be fixed by now. > I have a couple of other storage machines (mostly mps-based, but always > slightly different hardware) that show no such issue after updating to > 11.2. > > Any ideas? Maybe this box has some clocking problems incompatible with tickless kernel. Try get back to old periodic ticking with sysctl kern.eventtimer.periodic=1 instead of now default 0. Of, if you are curious, run ntpd if it is not already running, wait about an hour then look to its /var/db/ntpd.drift file to see if system clock is good or not. Perhaps, you can get better behaviour changing default value of kern.timecounter.hardware to another one from kern.timecounter.choice; same with kern.eventtimer.timer and kern.eventtimer.choice