From owner-freebsd-hackers@freebsd.org Thu Mar 17 05:37:09 2016 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 907ECAD3BE9 for ; Thu, 17 Mar 2016 05:37:09 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 2A845DAF; Thu, 17 Mar 2016 05:37:09 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u2H5b4sr085371 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Thu, 17 Mar 2016 07:37:04 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u2H5b4sr085371 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u2H5b4t5085370; Thu, 17 Mar 2016 07:37:04 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 17 Mar 2016 07:37:04 +0200 From: Konstantin Belousov To: Stanislav Sedov Cc: freebsd-hackers@freebsd.org, jhb@freebsd.org Subject: Re: Peformance issues with r278325 Message-ID: <20160317053704.GP1741@kib.kiev.ua> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Mar 2016 05:37:09 -0000 On Wed, Mar 16, 2016 at 01:29:47PM -0700, Stanislav Sedov wrote: > Hi! > > We're seeing weird performance issues on some of our systems on both > FreeBSD stable/10 and HEAD with r278325. In particular, the part of the > change that switched the lapic_ipi_wait to microsecond-based delay seem > to cause slowdowns. > > Some background. Out systems mostly run a single large application that > is both I/O heavy and memory allocation heavy. It is doing network traffic > processing, but that does not seem to be the culprit. What we noticed is that > when running on FreeBSD 10.2+ or HEAD the memory allocations (and, particularly, > munmaps) become exponentially slower the bigger the application becomes in terms > of the memory space allocated. This does not seem to happen on FreeBSD 10.1 > or stable/10 with the lapic_ipi_wait() change reverted. > > The top output when this slowdown happens shows the following: > ___________________________________________________________________________________ > / > | 1585 user -21 r31 20225M 17260M vm map 5 483:20 78.47% [beam.smp{beam.smp}] > | 1585 user -21 r31 20225M 17260M *proce 11 483:26 77.29% [beam.smp{beam.smp}] > | 1585 user -21 r31 20225M 17260M vm map 15 484:05 76.86% [beam.smp{beam.smp}] > | 1585 user -21 r31 20225M 17260M CPU21 21 262:49 76.86% [beam.smp{beam.smp}] > | 1585 user -21 r31 20225M 17260M CPU26 26 473:31 76.76% [beam.smp{beam.smp}] > | 1585 user -21 r31 20225M 17260M RUN 5 462:01 76.56% [beam.smp{beam.smp}] > | 1585 user -21 r31 20225M 17260M vm map 2 282:16 76.37% [beam.smp{beam.smp}] > | 1585 user -21 r31 20225M 17260M CPU4 4 475:24 76.27% [beam.smp{beam.smp}] > | 1585 user -21 r31 20225M 17260M CPU38 38 366:16 76.17% [beam.smp{beam.smp}] > | 1585 user -21 r31 20225M 17260M vm map 16 434:35 76.07% [beam.smp{beam.smp}] > `------------------------------------------------------------------------------------ > > I had a pmcstat output as well somewhere, but can't find it at the moment. But it was > indicating that a lot of time has been spend in VM mutexes as far as I remember. > > I'm not entirely sure why this change would cause any effects like this. One theory > I have is that TLB shootdowns become slower due to lapic_ipi_wait taking longer than > before to complete (as it was using pause instructions instead of explicit delays). > > Any ideas? Anything else I can look at to track the reason down? Out platform is > Supermicro X9DRi-LN4+ with dual Xeon(R) CPU E5-2690 v2 CPUs. I'm not sure if it's > platform specific or not though. I've seen that happen on servers with E5-2690 v3 > as well. > > Thanks! > > _____________________________________________________________________________________ > / > | int > | lapic_ipi_wait(int delay) > | { > | - int x; > | + int x, incr; > | > | /* > | - * Wait delay microseconds for IPI to be sent. If delay is > | + * Wait delay loops for IPI to be sent. This is highly bogus > | + * since this is sensitive to CPU clock speed. If delay is > | * -1, we wait forever. > | */ > | if (delay == -1) { > | - while ((lapic->icr_lo & APIC_DELSTAT_MASK) != APIC_DELSTAT_IDLE) > | - ia32_pause(); > | - return (1); > | - } > | - > | - for (x = 0; x < delay; x += 5) { > | + incr = 0; > | + delay = 1; > | + } else > | + incr = 1; > | + for (x = 0; x < delay; x += incr) { > | if ((lapic->icr_lo & APIC_DELSTAT_MASK) == APIC_DELSTAT_IDLE) > | return (1); > | - DELAY(5); > | + ia32_pause(); > | } > | return (0); > | } > | @@ -1433,9 +1433,9 @@ lapic_ipi_raw(register_t icrlo, u_int dest) > | intr_restore(saveintr); > | } > | > | -#define BEFORE_SPIN 50000 > | +#define BEFORE_SPIN 1000000 > | #ifdef DETECT_DEADLOCK > | -#define AFTER_SPIN 50 > | +#define AFTER_SPIN 1000 > | #endif > `----------------------------------------------------------------------------------- > IPIs are used not only for TLB shutdowns, but also to notify other CPUs about new runnable threads. Basically the call path is [setrunnable()->sched_wakeup()->]sched_add()->tdq_notify()->ipi(). The sched_wakeup()/sched_add() are the scheduler primitive called from misc. unlock pathes, to notify other cores about threads on run queues. Delivery of IPI causes rescheduling. Look at the turnstile code (for non-sleepable) or sleepqueue (for sleepable) locks. So if the latency of unlock becomes higher, it adds both latency to unlock and re-shuffle the patterns of runnable threads appearing on the run queues. And the latency of unlock is very important, as illustrated by significant gains from the adaptive locks. I think that the revert you posted above is a good candidate for commit, but I defer to John. What else you could do, is to try the HEAD kernel on your machines. Your CPUs should support x2APIC mode, which is utilized in HEAD. In x2APIC mode, there is no need to wait for the ack from the IPI command at all. Look at the verbose dmesg LAPIC detection lines to see if x2APIC is enabled.