Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 8 May 2016 14:52:22 +0200
From:      Jilles Tjoelker <jilles@stack.nl>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        threads@freebsd.org, arch@freebsd.org
Subject:   Re: Robust mutexes implementation
Message-ID:  <20160508125222.GA48862@stack.nl>
In-Reply-To: <20160507165956.GC89104@kib.kiev.ua>
References:  <20160505131029.GE2422@kib.kiev.ua> <20160506233011.GA99994@stack.nl> <20160507165956.GC89104@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, May 07, 2016 at 07:59:56PM +0300, Konstantin Belousov wrote:
> On Sat, May 07, 2016 at 01:30:11AM +0200, Jilles Tjoelker wrote:
> > > 4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
> > >    the lifetime of the shared mutex associated with a vnode' page.
> > >    Apparently, there is real code around which expects the following
> > >    to work:
> > >    - mmap a file, create a shared mutex in the mapping;
> > >    - the process exits;
> > >    - another process starts, mmaps the same file and expects that the
> > >      previously initialized mutex is still usable.

> > >    The knob changes the lifetime of such shared off-page from the
> > >    'destroy on last unmap' to either 'until vnode is reclaimed' or
> > >    until 'pthread_mutex_destroy' called, whatever comes first.

> > The 'until vnode is reclaimed' bit sounds like a recipe for hard to
> > reproduce bugs.

> > I do think it is related to the robust mutex patch, though.

> > Without robust mutexes and assuming threads do not unmap the memory
> > while having a mutex locked or while waiting on a condition variable, it
> > is sufficient to create the off-page mutex/condvar automatically in its
> > initial state when pthread_mutex_lock() or pthread_cond_*wait() are
> > called and no off-page object exists.

> > With robust mutexes, we need to store somewhere whether the next thread
> > should receive [EOWNERDEAD] or not, and this should persist even if no
> > processes have the memory mapped. This might be done by replacing
> > THR_PSHARED_PTR with a different value in the pthread_mutex_t. I'm not
> > sure I like that memory write being done from the kernel though.

> In principle, I agree with this.  Note that if we go with something
> like THR_OWNERDEAD_PTR, the kernel write to set the value would be not
> much different from the kernel write to unlock robust mutex with inlined
> lock structures.

> Still, I would prefer to to implement this now.  For the local purposes,
> the knob was enough, and default value will be 'disabled'.

OK. The patch still initializes umtx_shm_vnobj_persistent to 1 though.
There is also a leak if umtx_shm_vnobj_persistent is toggled from 1 to 0
while an unmapped object with an off-page is active.

I forgot two issues in my analysis.

Firstly, the automatic re-initialization (from r297185, in fact) allows
the next thread to lock the mutex even if the previous thread terminated
while holding the lock. Normally, one would expect that non-robust
mutexes cause the lock attempt to hang (or fail if it's a trylock),
except if a thread ID is reused (in which case the lock attempt may
instead succeed recursively or fail).

Secondly, the initialization attributes are lost. If the next thread
after last unmap sees THR_PSHARED_PTR, it has no idea whether it should
be robust or not, which type it should have and whether it should be
PP/PI. A mutex that was originally initialized as non-robust should not
be re-initialized as robust since the code will not handle the
[EOWNERDEAD] and [ENOTRECOVERABLE] errors (in the worst case, errors are
completely ignored and it is as if the mutex did not exist).

Special-casing robust to store in the THR_PSHARED_PTR location seems
strange since the other attributes are still lost.

All this is POSIX-compliant since POSIX specifies that the state of
synchronization objects becomes undefined on last unmap, and our
implementation fundamentally depends on that possibility. Unfortunately,
Linux and Solaris do not need the possibility. The automatic
re-initialization and umtx_vnode_persistent are just hacks that make
certain applications almost always work (but not always, and in such
cases it will be hard to debug).

Another issue with umtx_vnode_persistent is that it can hide high memory
usage. Filling up a page with pthread_mutex_t will create many pages
full of actual mutexes. This memory usage is only visible as long as it
is still mapped somewhere.

Apart from that, umtx_vnode_persistent can (at least conceptually) work
fully reliably for shared memory objects and tmpfs files, which do not
have persistent storage.

> > The below is a review of https://kib.kiev.ua/kib/pshared/robust.2.patch

> > > +		if ((error2 == 0 || error2 == EOWNERDEAD) && cancel)
> > >  			_thr_testcancel(curthread);

> > I don't think [EOWNERDEAD] should be swept under the carpet like this.
> > The cancellation cleanup handler will use the protected state and unlock
> > the mutex without making the state consistent and the state will be
> > unrecoverable.

> So your argument there is to return EOWNERDEAD and not cancelling,
> am I right ?  I reused part of your text as the comment.

Yes.

> > > +void
> > > +_mutex_leave_robust(struct pthread *curthread, struct pthread_mutex *m)
> > > +{
> > > +
> > > +	if (!is_robust_mutex(m))
> > > +		return;

> > This accesses the mutex after writing a value to the lock
> > word which allows other threads to lock it. A use after free may result,
> > since it is valid for another thread to lock, unlock and destroy the
> > mutex (assuming the mutex is not used otherwise later).

> > Memory ordering may permit the load of m->m_lock.m_flags to be moved
> > before the actual unlock, so this issue may not actually appear.

> > Given that the overhead of a system call on every robust mutex unlock is
> > not desired, the kernel's unlock of a terminated thread's mutexes will
> > unavoidably have this use after free. However, for non-robust mutexes
> > the previous guarantees should be kept.

> I agree that this is a bug, and agree that the kernel accesses to the
> curthread->inact_mtx are potentially unsafe.  I also did not wanted to
> issue a syscall for unlock of a robust mutex, as you noted.

> I fixed the bug with the is_robust_mutex() test in _mutex_leave_robust() 
> by caching the robust status.

> I was indeed worried by the kernel access issue you mentioned, but
> kernel is immune to 'bad' memory accesses. What bothered me is the ill
> ABA situation, where the lock memory is freed and repurposed, and then
> the lock word is written with the one of two specific values which give
> the same state as for locked mutex. This would cause kernel to 'unlock'
> it (but not to follow the invalid m_rb_link).

> But for this to happen, we must have a situation where a thread
> is being terminated before mutex_unlock_common() reached the
> _mutex_leave_robust() call. This is async thread termination, which
> then must be either process termination (including execve()), or a call
> to thr_exit() from a signal handler in our thread (including async
> cancellation).

> I am sure that the thr_exit() situation is non-conforming, so the only
> concern is the process exit, and then, shared robust mutex, because for
> private mutex, only the exiting process state is affected. I can verify
> in umtx_handle_rb(), for instance, that for USYNC_PROCESS_SHARED object,
> the underlying memory is backed by the umtx shm page. This would close
> the race.

> But this would interfere with libthr2, if that ever happen.

Hmm, libthr2 or non-standard synchronization primitive implementations
seem a good reason to not check for umtx shm page.

However, the existing checks can be made stricter. The umtx_handle_rb()
from robust.3.patch will use m_rb_lnk with no validation at all except
that it is a valid pointer. However, if the UMUTEX_ROBUST flag is not
set, the mutex should not have been in this list at all and it is
probably safer to ignore m_rb_lnk.

> > > @@ -37,7 +37,11 @@ struct umutex {
> > > +	__uintptr_t		m_rb_lnk;	/* Robust linkage */

> > Although Linux also stores the robust list nodes in the mutexes like
> > this, I think it increases the chance of strange memory corruption.
> > Making the robust list an array in the thread's memory would be more
> > reliable. If the maximum number of owned robust mutexes can be small,
> > this can have a fixed size; otherwise, it needs to grow as needed (which
> > does add an allocation that may fail to the pthread_mutex_lock path,
> > bad).

> I gave this proposal some thought.

> I very much dislike an idea of calling memory allocator on the lock, and
> esp. the trylock, path.  The later could need to obtain allocator locks
> which (sometimes partially) defeat the trylock purpose.

> I can use mmap(2) directly there, similarly how pthread_setspecific()
> was changed recently, which would avoid the issue of calling userspace
> allocator. Still, the problem of an addiitonal syscall, resulting
> ENOMEM and also the time to copy the current robust owned list to grown
> location are there (I do not see that using chunked allocations is
> reasonable, since it would be the same list as current m_rb_lnk, but at
> different level.

> I prefer to keep the robust linked list for these reasons.

There is a difference between chunked allocations and the current
m_rb_lnk in that the list would reside in local memory, not vulnerable
to other processes scribbling over it. This is probably not a major
issue since sharing a mutex already allows threads to block each other
indefinitely.

The X server has used shared memory across privilege levels for years
but not with anything like mutexes and condition variables. It uses X
protocol messages and xshmfence (custom synchronization object that does
not block the X server).

> In fact, the deciding argument would be actual application usage of the
> robustness. I thought, when writing the patch, when and how would I use
> the feature, and did not see compelling arguments to even try to use it.
> My stumbling block is the user data consistency recovery: for instance,
> I tried to write a loop which would increment shared variable N times,
> and I was not able to end up with any simple recovery mechanism from
> the aborted iteration, except writing iteration in assembly and have a
> parallel tick variable which enumerate each iteration action.

One way of using this is to never call pthread_mutex_consistent() and
only use it to report the problem and avoid hangs and crashes. A
slightly extended version would tear down the whole structure including
shared memory and all involved processes and rebuild it.

When calling pthread_mutex_consistent(), some data loss will almost
unavoidably occur. In some cases, it may be possible to re-initialize
just the state protected by the mutex. Alternatively, a thread that
modifies the state may make a copy of it before modifying it (some
thread-local StoreStore fences are needed); the [EOWNERDEAD] handling
can then put back the copy if it exists. This way, robust mutexes can be
used to make updates transactional.

Approaches based on message passing will be easier to get right and more
reliable but they might be too slow.

> Updated patch is at https://kib.kiev.ua/kib/pshared/robust.3.patch
> I did not added the check for umtx shm into the umtx_handle_rb() yet,
> waiting for your opinion.

-- 
Jilles Tjoelker



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160508125222.GA48862>