From owner-freebsd-fs@FreeBSD.ORG Sun Sep 1 17:41:52 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id B3AA9D58; Sun, 1 Sep 2013 17:41:52 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-ea0-x22f.google.com (mail-ea0-x22f.google.com [IPv6:2a00:1450:4013:c01::22f]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 0BAAE2AF8; Sun, 1 Sep 2013 17:41:51 +0000 (UTC) Received: by mail-ea0-f175.google.com with SMTP id m14so1912317eaj.34 for ; Sun, 01 Sep 2013 10:41:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=LtwKjNeLy7LB4V/JjYOhTOaG0e9nzM4vAai0pja/x1I=; b=C5eFS659ApT47TUA/a7uB5eNk0Hb9uOt/8oIBC/O5Yxw7iR6dKbeTFvz7PxiMxfpl5 B7onMDN6LdfWgllTy/AznJJHVO4JUWtPgmtd5HoIdu/aHcKTIaLRvhvtg+WgLE32F3bN poL+9JwjRFyZrXOAMdlH+zfkAt0PEVaq9TM99eC0nufzBICqIzqnsLurknZTLlLuV3yn W9SbgQdkz8DjOR0rsAfa5FdAoZikaVQlehJopd4okmMSWgwtCI8hWhVQzNZEAQf2K4O6 Hu3zKolmhpd5P06UN+SQjRHwlMdMgmBCd/Uk/cbLbFNX8JB+GeKGq7jXimcRquiGsBqj MGEw== X-Received: by 10.15.81.132 with SMTP id x4mr287727eey.100.1378057310115; Sun, 01 Sep 2013 10:41:50 -0700 (PDT) Received: from localhost ([178.150.115.244]) by mx.google.com with ESMTPSA id r48sm14972840eev.14.1969.12.31.16.00.00 (version=TLSv1.2 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 01 Sep 2013 10:41:49 -0700 (PDT) Sender: Mikolaj Golub Date: Sun, 1 Sep 2013 20:41:47 +0300 From: Mikolaj Golub To: Yamagi Burmeister Subject: Re: 9.2-RC1: LORs / Deadlock with SU+J on HAST in "memsync" mode Message-ID: <20130901174146.GA15654@gmail.com> References: <20130819115101.ae9c0cf788f881dc4de464c5@yamagi.org> <20130822121341.0f27cb5e372d12bab8725654@yamagi.org> <20130825175616.GA3472@gmail.com> <20130826160125.3b62df57515c45be3c9b2723@yamagi.org> MIME-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20130826160125.3b62df57515c45be3c9b2723@yamagi.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 01 Sep 2013 17:41:52 -0000 Hi, Yamagi, sorry for the delay. I can only work on this in spare time, which is mostly (some) weekends. On Mon, Aug 26, 2013 at 04:01:25PM +0200, Yamagi Burmeister wrote: > I'm sorry but the patch doesn't change anything. Processes accessing > the UFS on top of HAST still deadlock within a couple of minutes. οΛ, my patch fixed the issue that might occur on secondary disconnect. As you did not have disconnects according to your log, your issue is different. The core provided suggests this too. Anyway, I have updated my patch, as the first version had some issues (e.g. the approach of using one flags variable to store different hio states was not correct without locking because flags could be changed by two threads simultaneously). Here is the updated version: http://people.freebsd.org/~trociny/patches/hast.primary.c.memsync_secondary_disconnect.1.patch Pawel, what do you think about this? > trasz@ suggested that all "buf" maybe exhausted which would result in > an IO deadlock, but at least increasing their number by four times by > "kern.nbuf" doesn't change anything. > > > If it does not help, please, after the hang, get core images of the > > worker processes (both primary and secondary) using gcore(1) and > > provide them together with hastd binary and libraries it is linked > > with (from `ldd /sbin/hastd' list). Note, core files might expose > > secure information from your host, if this worries you, you can send > > them to me privately. > > No problem, it's a test setup without any production data. You can find > a tar archive with the binary and libs (all with debug symbols) here: > http://deponie.yamagi.org/freebsd/debug/lor_hast/hast_cores.tar.xz > > I have two HAST providers, therefor two core dumps for each host: > hast_deadlocked.core -> worker for the provider an which the processes > deadlocked. > hast_not_deadlocked.core -> worker for the other provider Thanks. The state of the primary node at the moment of the core is generated: 253 requests in local send queue, others queues are empty, no requests leaked. The threads state: ggate_recv_thread: got hio 0x801cac880 from free queue lock res->hr_amp_lock to activemap_write_start() blocked on hast_activemap_flush->pwrite(activemap) ggate_send_thread: got hio 0x801cac040 from done queue, waiting for res->hr_amp_lock to activemap_write_complete() local_send_thread: got hio 0x801cabfa0 from local send queue, blocked on pwrite(hio data) sync_thread: got hio 0x801cf9820 from free queue put hio to local send queue (read data from disk) waiting for read to complete remote_recv_thread: waiting for a hio in remote recv queue remote_send_thread: waiting for a hio in remote send queue ctrl_thread: waiting for a control request guard_thread sleeping on sigtimedwait() So deficit of io buffers made two threads block on writing data and metadata, and another one block on waiting the lock held by the thread that was writing metadata. This last thread was about to return the request to the UFS, potentially freeing a buffer, if it had not been that lock on activemap. As it involves flushing metadata to disk, Yamagi, you might try reducing metadata updates by tuning extentsize and keepdirty parameters. You can change the parameters only by recreating the providers though. What you need to change depends on your workload. If your applications are mostly updating some set of blocks then increasing the number of keepdirty extents so they would cover all updated blocks should reduce metadata updates. If your applications are mostly writing (appending) new data then the only way to reduce metadata flushes is to increase the extentsize. See hastctl(8) for the parameters description and what they affect. You can monitor the activemap updates using `hastctl list' command, comparing with the amount of writes. It is rather unfortunate that we are flushing metadata on disk under the lock, so another thread that might only have to update in-memory map is waiting for the first thread to complete. It looks like we can improve this introducing additional on-disk map lock, so when in-memory map is updated and it is detected that the on-disk map needs update too, the on-disk map lock is acquired and the on-memory lock is released before flushing the map. http://people.freebsd.org/~trociny/patches/hast.primary.c.activemap_flush_lock.1.patch Pawel, what do you think about this? Yamagi, you might want to try this patch if it changes anything to you. If it does not, could you please tell a little more about your workload so I could try to reproduce it. Also, if you are going to try my suggestions I would recommend this patch too, that fixes the stalls I observed trying to reproduce your case: ggate_recv_thread() got stuck sleeping on taking a free request, because it had not received the signal to wake up. http://people.freebsd.org/~trociny/patches/hast.primary.c.cv_broadcast.1.patch > While all processes accessing the UFS filesystem on top of the provider > deadlocked, HAST still seemed to transfer data to the secondary. At > least the process generated CPU load, the switch LEDs were blinking > and the harddrive LEDs showed activity on both sides. According to the core state it might be not a deadlock but rather a starvation. You could use `hastctl list' command to monitor HAST io statistics on both primary and secondary nodes, and if the counters are changing (and how) when the issue is observed. Also, you might be interested in this patch that adds the current queue sizes to `hastctl list' output and helps to see in real time what is going with a HAST node: http://people.freebsd.org/~trociny/patches/hast.queue_stats.1.patch The output look like below: queues: local: 237, send: 0, recv: 3, done: 0, idle: 17 The local queue is for the thread that does local io. The send/recv queues are for remote requests. The requests in done queue are these that completed locally and remotely and are waiting to be returned to the idle queue. The idle queue keeps free (unused) request buffers. In the example above the bottlneck is on local io operations. If the idle queue is empty it means that the HAST is overloaded. I would like to commit this patch and the patch that fixes the lost wakeups on free queue, if Pawel does not have objections. -- Mikolaj Golub