Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 1 Sep 2013 20:41:47 +0300
From:      Mikolaj Golub <trociny@FreeBSD.org>
To:        Yamagi Burmeister <lists@yamagi.org>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: 9.2-RC1: LORs / Deadlock with SU+J on HAST in "memsync" mode
Message-ID:  <20130901174146.GA15654@gmail.com>
In-Reply-To: <20130826160125.3b62df57515c45be3c9b2723@yamagi.org>
References:  <20130819115101.ae9c0cf788f881dc4de464c5@yamagi.org> <20130822121341.0f27cb5e372d12bab8725654@yamagi.org> <20130825175616.GA3472@gmail.com> <20130826160125.3b62df57515c45be3c9b2723@yamagi.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi,

Yamagi, sorry for the delay. I can only work on this in spare time,
which is mostly (some) weekends.

On Mon, Aug 26, 2013 at 04:01:25PM +0200, Yamagi Burmeister wrote:

> I'm sorry but the patch doesn't change anything. Processes accessing
> the UFS on top of HAST still deadlock within a couple of minutes.

οΛ, my patch fixed the issue that might occur on secondary
disconnect. As you did not have disconnects according to your log,
your issue is different.  The core provided suggests this too.

Anyway, I have updated my patch, as the first version had some issues
(e.g. the approach of using one flags variable to store different hio
states was not correct without locking because flags could be changed
by two threads simultaneously).

Here is the updated version:

http://people.freebsd.org/~trociny/patches/hast.primary.c.memsync_secondary_disconnect.1.patch

Pawel, what do you think about this?

> trasz@ suggested that all "buf" maybe exhausted which would result in 
> an IO deadlock, but at least increasing their number by four times by
> "kern.nbuf" doesn't change anything. 
> 
> > If it does not help, please, after the hang, get core images of the
> > worker processes (both primary and secondary) using gcore(1) and
> > provide them together with hastd binary and libraries it is linked
> > with (from `ldd /sbin/hastd' list). Note, core files might expose
> > secure information from your host, if this worries you, you can send
> > them to me privately.
> 
> No problem, it's a test setup without any production data. You can find
> a tar archive with the binary and libs (all with debug symbols) here:
> http://deponie.yamagi.org/freebsd/debug/lor_hast/hast_cores.tar.xz
> 
> I have two HAST providers, therefor two core dumps for each host:
> hast_deadlocked.core -> worker for the provider an which the processes
>                         deadlocked.
> hast_not_deadlocked.core -> worker for the other provider

Thanks. The state of the primary node at the moment of the core is
generated:

253 requests in local send queue, others queues are empty, no requests
leaked. The threads state:

ggate_recv_thread:
  got hio 0x801cac880 from free queue
  lock res->hr_amp_lock to activemap_write_start()
  blocked on hast_activemap_flush->pwrite(activemap)
ggate_send_thread:
  got hio 0x801cac040 from done queue,
  waiting for res->hr_amp_lock to activemap_write_complete()
local_send_thread:
  got hio 0x801cabfa0 from local send queue,
  blocked on pwrite(hio data)
sync_thread:
  got hio 0x801cf9820 from free queue
  put hio to local send queue (read data from disk)
  waiting for read to complete
remote_recv_thread:
  waiting for a hio in remote recv queue
remote_send_thread:
  waiting for a hio in remote send queue
ctrl_thread:
  waiting for a control request
guard_thread
  sleeping on sigtimedwait()

So deficit of io buffers made two threads block on writing data and
metadata, and another one block on waiting the lock held by the thread
that was writing metadata. This last thread was about to return the
request to the UFS, potentially freeing a buffer, if it had not been
that lock on activemap.

As it involves flushing metadata to disk, Yamagi, you might try
reducing metadata updates by tuning extentsize and keepdirty
parameters. You can change the parameters only by recreating the
providers though.  What you need to change depends on your workload.
If your applications are mostly updating some set of blocks then
increasing the number of keepdirty extents so they would cover all
updated blocks should reduce metadata updates. If your applications
are mostly writing (appending) new data then the only way to reduce
metadata flushes is to increase the extentsize. See hastctl(8) for the
parameters description and what they affect. You can monitor the
activemap updates using `hastctl list' command, comparing with the
amount of writes.

It is rather unfortunate that we are flushing metadata on disk under
the lock, so another thread that might only have to update in-memory
map is waiting for the first thread to complete. It looks like we can
improve this introducing additional on-disk map lock, so when
in-memory map is updated and it is detected that the on-disk map needs
update too, the on-disk map lock is acquired and the on-memory lock is
released before flushing the map.

http://people.freebsd.org/~trociny/patches/hast.primary.c.activemap_flush_lock.1.patch

Pawel, what do you think about this?

Yamagi, you might want to try this patch if it changes anything to
you. If it does not, could you please tell a little more about your
workload so I could try to reproduce it.

Also, if you are going to try my suggestions I would recommend this
patch too, that fixes the stalls I observed trying to reproduce your
case: ggate_recv_thread() got stuck sleeping on taking a free request,
because it had not received the signal to wake up.

http://people.freebsd.org/~trociny/patches/hast.primary.c.cv_broadcast.1.patch

> While all processes accessing the UFS filesystem on top of the provider
> deadlocked, HAST still seemed to transfer data to the secondary. At
> least the process generated CPU load, the switch LEDs were blinking 
> and the harddrive LEDs showed activity on both sides.

According to the core state it might be not a deadlock but rather a
starvation.

You could use `hastctl list' command to monitor HAST io statistics on
both primary and secondary nodes, and if the counters are changing
(and how) when the issue is observed.  Also, you might be interested
in this patch that adds the current queue sizes to `hastctl list'
output and helps to see in real time what is going with a HAST node:

http://people.freebsd.org/~trociny/patches/hast.queue_stats.1.patch

The output look like below:

    queues: local: 237, send: 0, recv: 3, done: 0, idle: 17

The local queue is for the thread that does local io. The send/recv
queues are for remote requests.  The requests in done queue are these
that completed locally and remotely and are waiting to be returned to
the idle queue.  The idle queue keeps free (unused) request buffers.

In the example above the bottlneck is on local io operations.  If the
idle queue is empty it means that the HAST is overloaded.

I would like to commit this patch and the patch that fixes the lost
wakeups on free queue, if Pawel does not have objections.

-- 
Mikolaj Golub



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130901174146.GA15654>