From owner-freebsd-hackers@freebsd.org Wed Mar 16 20:29:55 2016 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CA5B3AD3391 for ; Wed, 16 Mar 2016 20:29:55 +0000 (UTC) (envelope-from stas@freebsd.org) Received: from mx0.deglitch.com (mx0.deglitch.com [IPv6:2a00:13c0:63:7194:1::3]) by mx1.freebsd.org (Postfix) with ESMTP id 567C186D; Wed, 16 Mar 2016 20:29:55 +0000 (UTC) (envelope-from stas@freebsd.org) Received: from [IPv6:2620:10d:c082:1803:e9ee:be74:4bb7:1ef6] (unknown [IPv6:2620:10d:c090:200::8:4655]) by mx0.deglitch.com (Postfix) with ESMTPSA id 35C308FC0B; Wed, 16 Mar 2016 13:29:50 -0700 (PDT) From: Stanislav Sedov Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: Peformance issues with r278325 Date: Wed, 16 Mar 2016 13:29:47 -0700 Message-Id: To: freebsd-hackers@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\)) X-Mailer: Apple Mail (2.3112) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Mar 2016 20:29:55 -0000 Hi! We're seeing weird performance issues on some of our systems on both FreeBSD stable/10 and HEAD with r278325. In particular, the part of the change that switched the lapic_ipi_wait to microsecond-based delay seem to cause slowdowns. Some background. Out systems mostly run a single large application that is both I/O heavy and memory allocation heavy. It is doing network = traffic processing, but that does not seem to be the culprit. What we noticed = is that when running on FreeBSD 10.2+ or HEAD the memory allocations (and, = particularly, munmaps) become exponentially slower the bigger the application becomes = in terms of the memory space allocated. This does not seem to happen on FreeBSD = 10.1 or stable/10 with the lapic_ipi_wait() change reverted. The top output when this slowdown happens shows the following: = __________________________________________________________________________= _________ / | 1585 user -21 r31 20225M 17260M vm map 5 483:20 78.47% = [beam.smp{beam.smp}] | 1585 user -21 r31 20225M 17260M *proce 11 483:26 77.29% = [beam.smp{beam.smp}] | 1585 user -21 r31 20225M 17260M vm map 15 484:05 76.86% = [beam.smp{beam.smp}] | 1585 user -21 r31 20225M 17260M CPU21 21 262:49 76.86% = [beam.smp{beam.smp}] | 1585 user -21 r31 20225M 17260M CPU26 26 473:31 76.76% = [beam.smp{beam.smp}] | 1585 user -21 r31 20225M 17260M RUN 5 462:01 76.56% = [beam.smp{beam.smp}] | 1585 user -21 r31 20225M 17260M vm map 2 282:16 76.37% = [beam.smp{beam.smp}] | 1585 user -21 r31 20225M 17260M CPU4 4 475:24 76.27% = [beam.smp{beam.smp}] | 1585 user -21 r31 20225M 17260M CPU38 38 366:16 76.17% = [beam.smp{beam.smp}] | 1585 user -21 r31 20225M 17260M vm map 16 434:35 76.07% = [beam.smp{beam.smp}] = `-------------------------------------------------------------------------= ----------- I had a pmcstat output as well somewhere, but can't find it at the = moment. But it was indicating that a lot of time has been spend in VM mutexes as far as I = remember. I'm not entirely sure why this change would cause any effects like this. = One theory I have is that TLB shootdowns become slower due to lapic_ipi_wait taking = longer than before to complete (as it was using pause instructions instead of = explicit delays). Any ideas? Anything else I can look at to track the reason down? Out = platform is Supermicro X9DRi-LN4+ with dual Xeon(R) CPU E5-2690 v2 CPUs. I'm not = sure if it's platform specific or not though. I've seen that happen on servers with = E5-2690 v3 as well. Thanks! = __________________________________________________________________________= ___________ / | int | lapic_ipi_wait(int delay) | { | - int x; | + int x, incr; | =20 | /* | - * Wait delay microseconds for IPI to be sent. If delay is | + * Wait delay loops for IPI to be sent. This is highly bogus | + * since this is sensitive to CPU clock speed. If delay is | * -1, we wait forever. | */ | if (delay =3D=3D -1) { | - while ((lapic->icr_lo & APIC_DELSTAT_MASK) !=3D = APIC_DELSTAT_IDLE) | - ia32_pause(); | - return (1); | - } | - | - for (x =3D 0; x < delay; x +=3D 5) { | + incr =3D 0; | + delay =3D 1; | + } else | + incr =3D 1; | + for (x =3D 0; x < delay; x +=3D incr) { | if ((lapic->icr_lo & APIC_DELSTAT_MASK) =3D=3D = APIC_DELSTAT_IDLE) | return (1); | - DELAY(5); | + ia32_pause(); | } | return (0); | } | @@ -1433,9 +1433,9 @@ lapic_ipi_raw(register_t icrlo, u_int dest) | intr_restore(saveintr); | } | =20 | -#define BEFORE_SPIN 50000 | +#define BEFORE_SPIN 1000000 | #ifdef DETECT_DEADLOCK | -#define AFTER_SPIN 50 | +#define AFTER_SPIN 1000 | #endif = `-------------------------------------------------------------------------= ---------- -- Stanislav Sedov ST4096-RIPE