Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 25 Apr 2005 11:50:14 +0100 (BST)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        performance@FreeBSD.org
Subject:   Re: Memory allocation performance/statistics patches
Message-ID:  <20050425114546.O74930@fledge.watson.org>
In-Reply-To: <20050417134448.L85588@fledge.watson.org>
References:  <20050417134448.L85588@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help

I now have updated versions of these patches, which correct some 
inconsistencies in approach (universal use of curcpu now, for example), 
remove some debugging code, etc.  I've received relatively little 
performance feedback on them, and would appreciate it if I could get some. 
:-)  Especially as to whether these impact disk I/O related workloads, 
useful macrobenchmarks, etc.  The latest patch is at:

     http://www.watson.org/~robert/freebsd/netperf/20050425-uma-mbuf-malloc-critical.diff

The changes in the following files in the combined patch are intended to 
be broken out in to separate patches, as desired, as follows:

kern_malloc.c		malloc.diff
kern_mbuf.c		mbuf.diff
uipc_mbuf.c		mbuf.diff
uipc_syscalls.c		mbuf.diff
malloc.h		malloc.diff
mbuf.h			mbuf.diff
pcpu.h			malloc.diff, mbuf.diff, uma.diff
uma_core.c		uma.diff
uma_int.h		uma.diff

I.e., the pcpu.h changes are a dependency for all of the remaining 
changes.  As before, I'm interested in both the impact of individual 
patches, and the net effect of the total change associated with all 
patches applied.

Because this diff was generated by p4, patch may need some help in 
identifying the targets of each part of the diff.

Robert N M Watson

On Sun, 17 Apr 2005, Robert Watson wrote:

>
> Attached please find three patches:
>
> (1) uma.diff, which modifies the UMA slab allocator to use critical
>    sections instead of mutexes to protect per-CPU caches.
>
> (2) malloc.diff, which modifies the malloc memory allocator to use
>    critical sections and per-CPU data instead of mutexes to store
>    per-malloc-type statistics, coalescing for the purposes of the sysctl
>    used to generate vmstat -m output.
>
> (3) mbuf.diff, which modifies the mbuf allocator to use per-CPU data and
>    critical sections for statistics, instead of synchronization-free
>    statistics which could result in substantial inconsistency on SMP
>    systems.
>
> These changes are facilitated by John Baldwin's recent re-introduction of 
> critical section optimizations that permit critical sections to be 
> implemented "in software", rather than using the hardware interrupt disable 
> mechanism, which is quite expensive on modern processors (especially Xeon P4 
> CPUs).  While not identical, this is similar to the softspl behavior in 4.x, 
> and Linux's preemption disable mechanisms (and various other post-Vax systems 
> :-)).
>
> The reason this is interesting is that it allows synchronization of per-CPU 
> data to be performed at a much lower cost than previously, and consistently 
> across UP and SMP systems.  Prior to these changes, the use of critical 
> sections and per-CPU data as an alternative to mutexes would lead to an 
> improvement on SMP, but not on UP.  So, that said, here's what I'd like us to 
> look at:
>
> - Patches (1) and (2) are intended to improve performance by reducing the
>  overhead of maintaining cache consistency and statistics for UMA and
>  malloc(9), and may universally impact performance (in a small way) due
>  to the breadth of their use through the kernel.
>
> - Patch (3) is intended to restore consistency to statistics in the
>  presence of SMP and preemption, at the possible cost of some
>  performance.
>
> I'd like to confirm that for the first two patches, for interesting 
> workloads, performance generally improves, and that stability doesn't 
> degrade.  For the third partch, I'd like to quantify the cost of the changes 
> for interesting workloads, and likewise confirm no loss of stability.
>
> Because these will have a relatively small impact, a fair amount of caution 
> is required in testing.  We may be talking about a percent or two, maybe 
> four, difference in benchmark performance, and many benchmarks have a higher 
> variance than that.
>
> A couple of observations for those interested:
>
> - The INVARIANTS panic with UMA seen in some earlier patch versions is
>  believed to be corrected.
>
> - Right now, because I use arrays of foo[MAXCPUS], I'm concerned that
>  different CPUs will be writing to the same cache line as they're
>  adjacent in memory.  Moving to per-CPU chunks of memory to hold this
>  stuff is desirable, but I think first we need to identify a model by
>  which to do that cleanly.  I'm not currently enamored of the 'struct
>  pcpu' model, since it makes us very sensitive to ABI changes, as well as
>  not offering a model by which modules can register new per-cpu data
>  cleanly.  I'm also inconsistent about how I dereference into the arrays,
>  and intend to move to using 'curcpu' throughout.
>
> - Because mutexes are no longer used in UMA, and not for the others
>  either, stats read across different CPUs that are coalesced may be
>  slightly inconsistent.  I'm not all that concerned about it, but it's
>  worth thinking on.
>
> - Malloc stats for realloc() are still broken if you apply this patch.
>
> - High watermarks are no longer maintained for malloc since they require a
>  global notion of "high" that is tracked continuously (i.e., at each
>  change), and there's no longer a global view except when the observer
>  kicks in (sysctl).  You can imagine various models to restore some
>  notion of a high watermark, but I'm not currently sure which is the
>  best.  The high watermark notion is desirable though.
>
> So this is a request for:
>
> (1) Stability testing of these patches.  Put them on a machine, make them
>    hurt.  If things go South, try applying the patches one by one until
>    it's clear which is the source.
>
> (2) Performance testing of these patches.  Subject to the challenges in
>    testing them.  If you are interested, please test each patch
>    separately to evaluate its impact on your system.  Then apply all
>    together and see how it evens out.  You may find that the mbuf
>    allocator patch outweighs the benefits of the other two patches, if
>    so, that is interesting and something to work on!
>
> I've done some micro-benchmarking using tools like netblast, syscall_timing, 
> etc, but I'm interested particularly in the impact on macrobenchmarks.
>
> Thanks!
>
> Robert N M Watson



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050425114546.O74930>