Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 26 Aug 2010 13:03:38 -0700
From:      mdf@FreeBSD.org
To:        freebsd-current@freebsd.org
Cc:        Jeff Roberson <jroberson@jroberson.net>
Subject:   sched_pin() bug in SCHED_ULE
Message-ID:  <AANLkTin6V9pc3d7ifyOmR2V-H5-g_AQFji-LbZzWfAKM@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
Back at the beginning of August I posted about issues with sched_pin()
and witness_warn():

http://lists.freebsd.org/pipermail/freebsd-hackers/2010-August/032553.html

After a lot of debugging I think I've basically found the issue.  I
found this bug on stable/7, but looking at the commit log for CURRENT
I don't see any reason the bug doesn't exist there.  This analysis is
specific to amd64/i386 but the problem is likely to exist on most
arches.

The basic problem is that in both tdq_move() and sched_setcpu() can
manipulate the ts->ts_cpu variable of another process, which may be
running.  This means that a process running on CPU N may be set to run
on CPU M the next context switch.

Then, that process may call sched_pin(), then grab a PCPU variable.
An IPI_PREEMPT can come in, causing the thread to call mi_switch() in
ipi_bitmap_handler().  sched_switch() will then notice that ts->ts_cpu
is not PCPU_GET(cpuid), and call sched_switch_migrate(), migrating the
thread to the intended CPU.  Thus after sched_pin() and potentially
before using any PCPU variable the process can get migrated to another
CPU, and now it is using a PCPU variable for the wrong processor.

Given that ts_cpu can be set by other threads, it doesn't seem worth
checking at sched_pin time whether ts_cpu is not the same as td_oncpu,
because to make the check would require taking the thread_lock.
Instead, I propose adding a check for !THREAD_CAN_MIGRATE() before
calling sched_switch_migrate().  However, sched_pin() is also used by
sched_bind(9) to force the thread to stay on the intended cpu.  I
would handle this by adding a TSF_BINDING state that is different from
TSF_BOUND.

The thing that has me wondering whether this is all correct is that I
don't see any checks in tdq_move() or sched_setcpu() for either
TSF_BOUND or THREAD_CAN_MIGRATE().  Perhaps that logic is managed in
the various calling functions; in that case I would propose adding
asserts that the thread is THREAD_CAN_MIGRATE() unless it's marked
TSF_BINDING.

Does this analysis seem correct?

Thanks,
matthew



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AANLkTin6V9pc3d7ifyOmR2V-H5-g_AQFji-LbZzWfAKM>