Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 22 Oct 2019 14:48:56 +0300
From:      Andriy Gapon <avg@FreeBSD.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        FreeBSD Current <freebsd-current@FreeBSD.org>
Subject:   Re: thread on sleepqueue does not wake up after timeout
Message-ID:  <3a67f9a9-31cf-5814-4a68-8bdd6063b21e@FreeBSD.org>
In-Reply-To: <20191022104434.GM73312@kib.kiev.ua>
References:  <aff7b1e5-c380-9d86-d638-047e618894e6@FreeBSD.org> <20191022104434.GM73312@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 22/10/2019 13:44, Konstantin Belousov wrote:
> On Tue, Oct 22, 2019 at 01:08:59PM +0300, Andriy Gapon wrote:
>>
>> We observe a problem that happens very rarely (about once a month across many
>> test machines).  The problem is that a thread remain in sleepq_timedwait() even
>> after its timeout expires.  The thread's td_slpcallout looks like the callout
>> has fired.  But the thread's state looks like it was never notified.
>> E.g.:
>> (kgdb) p td->td_slpcallout
>> $1 = {c_links = {le = {le_next = 0xfffff800108e6470, le_prev =
>> 0xfffffe0000be6ea8}, sle = {sle_next = 0xfffff800108e6470}, tqe = {tqe_next =
>> 0xfffff800108e6470, tqe_prev = 0xfffffe0000be6ea8}}, c_time = 160957479343159,
>>   c_precision = 268435450, c_arg = 0xfffff80184602000, c_func =
>> 0xffffffff807481d0 <sleepq_timeout>, c_lock = 0x0, c_flags = 2, c_iflags = 272,
>> c_cpu = 6, c_exec_time = 160957506517070} [*]
>> (kgdb) p/x td->td_flags
>> $5 = 0x80000004
> What is the bit 31 in your flags ?  FreeBSD does not use the bit.

It's TDF_NOSWAP, a local addition.
We use it to prohibit full process swapout (I guess that means kernel stacks).

>> (kgdb) p td->td_sqqueue
>> $8 = 0
>> (kgdb) p td->td_sleepqueue
>> $9 = (struct sleepqueue *) 0x0
>> (kgdb) p td->td_wchan
>> $10 = (void *) 0xfffff802b990df38
>>
>>
>> Has anyone seen anything like this problem?
> Yes, but it was very long time ago.  See r303426.

Yeah, we are based off r329000 plus a bunch of merges for various fixes.
One thing I forgot to mention is that it seems to happen only on VMware guests,
but maybe it's only because we have many more virtual test boxes than we have
physical ones.
One thing I suspected was that binuptime() could somehow jump backwards...

>> Any advice on how to diagnose it?
>>
>> Thanks!
>>
>> P.S.
>> c_exec_time is our addition, we set this field right before firing a callback
>> and we reset it to zero when a callout is (re-)scheduled.


-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3a67f9a9-31cf-5814-4a68-8bdd6063b21e>