Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 15 Dec 1996 13:32:03 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        toor@dyson.iquest.net (John S. Dyson)
Cc:        toor@dyson.iquest.net, phk@critter.tfs.com, peter@spinner.dialix.com, dyson@freebsd.org, smp@freebsd.org, haertel@ichips.intel.com
Subject:   Re: some questions concerning TLB shootdowns in FreeBSD
Message-ID:  <199612152032.NAA23823@phaeton.artisoft.com>
In-Reply-To: <199612151654.LAA05078@dyson.iquest.net> from "John S. Dyson" at Dec 15, 96 11:54:22 am

next in thread | previous in thread | raw e-mail | index | archive | help
> > This won't work because processes seldom have the entire address space
> > shared (vm_refcnt.)  I am sure that when we get true kernel multithreading
> > that will not be true though.  In order to test if a section of an
> > address space is shared, you have to do something like this (and
> > this can take LOTS of time.) (I might have levels of indirection
> > off here, I am also not taking into account submaps -- which
> > complicate the code further, by entailing recursively calling
> > the map/object traversal again -- but recursion is a major
> > no-no in the kernel, as we have found.)
> > 
> 
> Note that I do see that you were talking about shared address spaces,
> but address spaces are already partially shared.  To do the thing completely
> requires traversing alot of the VM data structures.  I would suggest
> that a coarser grained scheme for pmap_update (invtlb) be considered in
> the case of SMP.  Also (Peter's ?) suggestion that we have individual
> alternate page table's (and temporary mapping pages) for each CPU's
> has merit.
> 
> It is likely that large numbers of TLB flushes could be eliminated
> if the above were implemented.  Since global TLB flushes are going to
> be fairly expensive, let's minimize them -- but scanning the VM
> data structures is going to be expensive no matter how we do it.
> 
> Note that I have put individual page invalidates into pmap -- we
> need to usually remove those in the SMP code.  (There are some
> special mapping pages where we should probably continue doing
> the page invalidates -- but those should also be per-cpu.)

Some potential optimizations:

1)	This only applys to written pages not marked copy-on-write;
	read-only pages and pages that will be copied on write (like
	those in your note about "address spaces are already shared")

2)	Flushing can be "lazy" in most cases.  That is, the page could
	be marked invalid for a particular CPU, and only flushed if
	that CPU needs to use it.  For a generic first time implementation,
	a single unsigned long with CPU invalidity bits could be used
	(first time because it places a 32 processor limit, which I
	feel is an unacceptable limitation -- I want to run on Connection
	Machines some day) as an addition to the page attributes.  For
	the most part, it is important to realize that this is a
	negative validity indicator.  This dictates who has to do
	the work: the CPU that wants to access the page.  The higher
	the process CPU affinity, the less this will happen.

3)	For processes with shared address space, the common data area
	for all CPU's should be grown.  Yes, I realize this means a
	seperate virtual address space for CPU private, CPU shared,
	and per process user space addressing.  This is less of a burden
	than you might thing, if you divorce the kernel stack from the
	idea of process and place it squarely on the head of kernel
	threads.  For a blocking call, a per CPU thread pool can be
	used as a context container.  This would require another bitmap
	on sleep events so that the CPU's affected can be notified
	without blocking everyone.

4)	One obvious consequence of a per SPU thread pool approach
	(which I also think is acceptable) is that the wakeup must be
	processed on the CPU on which it went to sleep.  Theoretically,
	this should mean very little, since the CPU returning a kernel
	thread is not necessarily being bound to the process.
	Practically, it probably means that rebinding the CPU for a
	process can only occur on a blocking system call entry (by
	choosing which processor to acquire the kernel thread to handle
	the blocking call) or on involuntary context switch.  This is
	acceptable, since a process making a blocking call or for
	which the CPU has been involuntarily relinquished will not
	feel the calculation overhead of the decision; it will be
	buried in the latency before it is next scheduled to run.
	For an involuntary context switch, this takes the form of
	picking which CPU's run queue to insert the process on.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199612152032.NAA23823>