Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 16 Feb 2012 18:19:06 -0800
From:      Julian Elischer <julian@freebsd.org>
To:        davidxu@freebsd.org
Cc:        Alexander Kabaev <kan@freebsd.org>, threads@freebsd.org, David Xu <listlog2011@gmail.com>, FreeBSD Stable <freebsd-stable@freebsd.org>, Andriy Gapon <avg@freebsd.org>
Subject:   Re: pthread_cond_timedwait() broken in 9-stable? (from JAN 10)
Message-ID:  <4F3DB91A.2090806@freebsd.org>
In-Reply-To: <4F3DB3DB.2060603@gmail.com>
References:  <4F3C2671.3090808__7697.00510795719$1329343207$gmane$org@freebsd.org>	<4F3D3E2D.9090100@FreeBSD.org>	<4F3D6FDD.9050808@freebsd.org> <4F3D89CD.9050309@freebsd.org> <4F3DA27A.3090903@freebsd.org> <4F3DB3DB.2060603@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2/16/12 5:56 PM, David Xu wrote:
> On 2012/2/17 8:42, Julian Elischer wrote:
>> Adding David Xu for his thoughts since he reqrote the code in 
>> quesiton in revision 213098
>>
>> On 2/16/12 2:57 PM, Julian Elischer wrote:
>>> On 2/16/12 1:06 PM, Julian Elischer wrote:
>>>> On 2/16/12 9:34 AM, Andriy Gapon wrote:
>>>>> on 15/02/2012 23:41 Julian Elischer said the following:
>>>>>> The program fio (an IO test in ports) uses pthreads
>>>>>>
>>>>>> the following code (from fio-2.0.3, but its in earlier code too)
>>>>>> has suddenly started misbehaving.
>>>>>>
>>>>>>          clock_gettime(CLOCK_REALTIME,&t);
>>>>>>          t.tv_sec += seconds + 10;
>>>>>>
>>>>>>          pthread_mutex_lock(&mutex->lock);
>>>>>>
>>>>>>          while (!mutex->value&&  !ret) {
>>>>>>                  mutex->waiters++;
>>>>>>                  ret = 
>>>>>> pthread_cond_timedwait(&mutex->cond,&mutex->lock,&t);
>>>>>>                  mutex->waiters--;
>>>>>>          }
>>>>>>
>>>>>>          if (!ret) {
>>>>>>                  mutex->value--;
>>>>>>                  pthread_mutex_unlock(&mutex->lock);
>>>>>>          }
>>>>>>
>>>>>>
>>>>>> It turns out that 'ret' sometimes comes back instantly (on my 
>>>>>> machine) with a
>>>>>> value of 60 (ETIMEDOUT)
>>>>>> despite the fact that we set the timeout 10 seconds into the 
>>>>>> future.
>>>>>>
>>>>>> Has anyone else seen anything like this?
>>>>>> (and yes the condition variable attribute have been set to use 
>>>>>> the REALTIME clock).
>>>>> But why?
>>>>>
>>>>> Just a hypothesis that maybe there is some issue with time 
>>>>> keeping on that system.
>>>>> How would that code work out for you with MONOTONIC?
>>>>
>>>> Jens Axboe, (CC'd) tried both CLOCK_REALTIME and CLOCK_MONOTONIC, 
>>>> and they both had the same problem..
>>>> i.e. random early returns with ETIMEDOUT.
>>>>
>>>> I think we will try move out machine forward to a newer -stable 
>>>> to see if it resolves.
>>> Kan upgraded the machine today to today's 9.x branch tip and the 
>>> problem still occurs.
>>> 8.x does not have this problem.
>>>
>>> I have not got a 9-RELEASE machine to test on.. so I can not tell 
>>> if this came in with the burst of stuff
>>> that came in after the 9.x branch was unfrozen after the release 
>>> of 9.0.
>>>
>>>
>>
> I am trying to reproduce the problem,  do you have complete sample 
> code to test ?

I'm still looking the exact set
but on my machine (4 cpus) the program from ports sysutils/fio 
exhibits the problem when used with
kern.timecounter.hardware=TSC-low and with the following config file:

pu05 # cat config.fio

[global]
#clocksource=cpu
direct=1
rw=randread
bs=4096
fill_device=1
numjobs=16
iodepth=16
#ioengine=posixaio
#ioengine=psync
ioengine=psync
group_reporting
norandommap
time_based
runtime=60000
randrepeat=0

[file1]
filename=/dev/ada0

pu05 #
pu05 # fio config.fio
fio: this platform does not support process shared mutexes, forcing 
use of threads. Use the 'thread' option to get rid of this warning.
file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync, iodepth=16
...
file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync, iodepth=16
fio 2.0.3
Starting 15 threads and 1 process
fio: job startup hung? exiting.
fio: 5 jobs failed to start
Segmentation fault (core dumped)
pu05#


The reason 5 jobs failed to start is because the parent timed out on 
them immediately.
It didn't time out on 10 of them apparently.


if I set the timer to ACPI-fast it works as expected..
>
> Regards,
> David Xu
>
>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F3DB91A.2090806>