Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 17 Mar 2016 07:37:04 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Stanislav Sedov <stas@FreeBSD.org>
Cc:        freebsd-hackers@freebsd.org, jhb@freebsd.org
Subject:   Re: Peformance issues with r278325
Message-ID:  <20160317053704.GP1741@kib.kiev.ua>
In-Reply-To: <FA50A68E-7F3D-4361-8A8A-EB7F97EF3D00@FreeBSD.org>
References:  <FA50A68E-7F3D-4361-8A8A-EB7F97EF3D00@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Mar 16, 2016 at 01:29:47PM -0700, Stanislav Sedov wrote:
> Hi!
> 
> We're seeing weird performance issues on some of our systems on both
> FreeBSD stable/10 and HEAD with r278325.  In particular, the part of the
> change that switched the lapic_ipi_wait to microsecond-based delay seem
> to cause slowdowns.
> 
> Some background.  Out systems mostly run a single large application that
> is both I/O heavy and memory allocation heavy.  It is doing network traffic
> processing, but that does not seem to be the culprit.  What we noticed is that
> when running on FreeBSD 10.2+ or HEAD the memory allocations (and, particularly,
> munmaps) become exponentially slower the bigger the application becomes in terms
> of the memory space allocated.  This does not seem to happen on FreeBSD 10.1
> or stable/10 with the lapic_ipi_wait() change reverted.
> 
> The top output when this slowdown happens shows the following:
>    ___________________________________________________________________________________
>   /
>  | 1585 user   -21  r31 20225M 17260M vm map  5 483:20  78.47% [beam.smp{beam.smp}]
>  | 1585 user   -21  r31 20225M 17260M *proce 11 483:26  77.29% [beam.smp{beam.smp}]
>  | 1585 user   -21  r31 20225M 17260M vm map 15 484:05  76.86% [beam.smp{beam.smp}]
>  | 1585 user   -21  r31 20225M 17260M CPU21  21 262:49  76.86% [beam.smp{beam.smp}]
>  | 1585 user   -21  r31 20225M 17260M CPU26  26 473:31  76.76% [beam.smp{beam.smp}]
>  | 1585 user   -21  r31 20225M 17260M RUN     5 462:01  76.56% [beam.smp{beam.smp}]
>  | 1585 user   -21  r31 20225M 17260M vm map  2 282:16  76.37% [beam.smp{beam.smp}]
>  | 1585 user   -21  r31 20225M 17260M CPU4    4 475:24  76.27% [beam.smp{beam.smp}]
>  | 1585 user   -21  r31 20225M 17260M CPU38  38 366:16  76.17% [beam.smp{beam.smp}]
>  | 1585 user   -21  r31 20225M 17260M vm map 16 434:35  76.07% [beam.smp{beam.smp}]
>   `------------------------------------------------------------------------------------
> 
> I had a pmcstat output as well somewhere, but can't find it at the moment.  But it was
> indicating that a lot of time has been spend in VM mutexes as far as I remember.
> 
> I'm not entirely sure why this change would cause any effects like this.  One theory
> I have is that TLB shootdowns become slower due to lapic_ipi_wait taking longer than
> before to complete (as it was using pause instructions instead of explicit delays).
> 
> Any ideas?  Anything else I can look at to track the reason down?  Out platform is
> Supermicro X9DRi-LN4+ with dual Xeon(R) CPU E5-2690 v2 CPUs.  I'm not sure if it's
> platform specific or not though.  I've seen that happen on servers with E5-2690 v3
> as well.
> 
> Thanks!
> 
>   _____________________________________________________________________________________
>  /
> |  int
> |  lapic_ipi_wait(int delay)
> |  {
> | -       int x;
> | +       int x, incr;
> |  
> |         /*
> | -        * Wait delay microseconds for IPI to be sent.  If delay is
> | +        * Wait delay loops for IPI to be sent.  This is highly bogus
> | +        * since this is sensitive to CPU clock speed.  If delay is
> |          * -1, we wait forever.
> |          */
> |         if (delay == -1) {
> | -               while ((lapic->icr_lo & APIC_DELSTAT_MASK) != APIC_DELSTAT_IDLE)
> | -                       ia32_pause();
> | -               return (1);
> | -       }
> | -
> | -       for (x = 0; x < delay; x += 5) {
> | +               incr = 0;
> | +               delay = 1;
> | +       } else
> | +               incr = 1;
> | +       for (x = 0; x < delay; x += incr) {
> |                 if ((lapic->icr_lo & APIC_DELSTAT_MASK) == APIC_DELSTAT_IDLE)
> |                         return (1);
> | -               DELAY(5);
> | +               ia32_pause();
> |         }
> |         return (0);
> |  }
> | @@ -1433,9 +1433,9 @@ lapic_ipi_raw(register_t icrlo, u_int dest)
> |         intr_restore(saveintr);
> |  }
> |  
> | -#define        BEFORE_SPIN     50000
> | +#define        BEFORE_SPIN     1000000
> |  #ifdef DETECT_DEADLOCK
> | -#define        AFTER_SPIN      50
> | +#define        AFTER_SPIN      1000
> |  #endif
>  `-----------------------------------------------------------------------------------
> 

IPIs are used not only for TLB shutdowns, but also to notify other CPUs
about new runnable threads.  Basically the call path is
	[setrunnable()->sched_wakeup()->]sched_add()->tdq_notify()->ipi().
The sched_wakeup()/sched_add() are the scheduler primitive called from
misc. unlock pathes, to notify other cores about threads on run queues.
Delivery of IPI causes rescheduling. Look at the turnstile code (for
non-sleepable) or sleepqueue (for sleepable) locks.

So if the latency of unlock becomes higher, it adds both latency to unlock
and re-shuffle the patterns of runnable threads appearing on the run queues.
And the latency of unlock is very important, as illustrated by significant
gains from the adaptive locks.

I think that the revert you posted above is a good candidate for commit,
but I defer to John.

What else you could do, is to try the HEAD kernel on your machines.
Your CPUs should support x2APIC mode, which is utilized in HEAD.  In
x2APIC mode, there is no need to wait for the ack from the IPI command
at all.  Look at the verbose dmesg LAPIC detection lines to see if
x2APIC is enabled.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160317053704.GP1741>