Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 16 Jan 2016 20:13:06 +0200
From:      Mykola Golub <trociny@FreeBSD.org>
To:        Shahin Hasanov <shahinhasanov@hotmail.com>
Cc:        FREEBSD_QUESTION <freebsd-questions@freebsd.org>
Subject:   Re: the switching time hastd from secondary to primary
Message-ID:  <20160116181305.GA2165@gmail.com>
In-Reply-To: <DUB127-W36479628640DB40F39E12BB6CC0@phx.gbl>
References:  <DUB127-W2563827245EC96990575DDB6CC0@phx.gbl> <CAA2O=b84TtRyjYgFL9v1e36nERE4QFQoePx9LLFi10bC-cXHSA@mail.gmail.com> <DUB127-W36479628640DB40F39E12BB6CC0@phx.gbl>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jan 14, 2016 at 02:23:46PM +0400, Shahin Hasanov wrote:

> In /usr/local/sbin/ucarp_up.sh(below shown extract of it) script
> ucarp waiting while it became primary. It tooks about 20 sec as
> written
> http://www.freebsd.org/cgi/man.cgi?query=hast.conf&apropos=0&sektion=0&manpath=FreeBSD+10.2-RELEASE&arch=default&format=html
>  .
> for i in `jot 30`; do
>         pgrep -f "hastd: ${resource} \(secondary\)" >/dev/null 2>&1 || break
>         sleep 1
> done
> if pgrep -f "hastd: ${resource} \(secondary\)" >/dev/null 2>&1; then
>         logger -p local0.error -t hast "Secondary process for resource ${resource} is still running after 30 seconds."
>         exit 1
> fi

Looking at the logs would be nice. But I guess you are hitting here
timeout in the thread waiting for incoming data from primary. This
timeout is 2 * HAST_KEEPALIVE, and HAST_KEEPALIVE is hardcoded to 10
sec.

So right now it can be changed only by recompiling hastd. On the other
hand, hitting this timeout means that the connection was not closed
properly, so it is not a case, I would expected for "planned"
failovering, when the role is changed using `hastctl role`
commands. This looks like rather a case of disaster recovery after
networking partitioning, host crash, hang, etc.. In my opinion waiting
for 20 sec is not bad comparing with possibility to have split-brain
if the former primary is still alive.

If you observe 20 sec timeout when doing "planned" failovering, I
guess there is something wrong with the scripts that do switching.

-- 
Mykola Golub



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160116181305.GA2165>