Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 28 Oct 2015 15:21:38 -0700
From:      "Russell L. Carter" <rcarter@pinyon.org>
To:        Bryan Drewery <bdrewery@FreeBSD.org>, FreeBSD Ports ML <freebsd-ports@freebsd.org>
Subject:   Re: hung poudriere bulk recovery
Message-ID:  <56314A72.9020005@pinyon.org>
In-Reply-To: <563147BE.2070604@FreeBSD.org>
References:  <562A6185.5000305@pinyon.org> <563147BE.2070604@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi Bryan,

On 10/28/15 15:10, Bryan Drewery wrote:
> On 10/23/2015 9:34 AM, Russell L. Carter wrote:
>>
>> Greetings,
>>
>> Recently my nightly cron poudriere builds have been occasionally
>> hanging.  For instance, here's last night's, with apparently no
>> progress for over 10 hours:
>>
>> root@terpsichore> poudriere status
>> SET PORTS   JAIL            BUILD                STATUS         QUEUE
>> BUILT FAIL SKIP IGNORE REMAIN TIME     LOGS
>> -   default 10-stable-amd64 2015-10-22_22h30m08s parallel_build   488
>>   34    0    0      0    454 10:45:56
>> /ssd1/poudriere/data/logs/bulk/10-stable-amd64-default/2015-10-22_22h30m08s
>> root@terpsichore>
>>
>
> Also check 'poudriere status -b' to see per-builder status. Something
> may be actually doing something. Poudriere will timeout builds after a
> long time. I forget the default but it may be up to 24 hours.

Good to know.  I will try that out, probably tomorrow morning.   The
last two night's poudriere bulk builds have hung, but as I mentioned
before, when run from the console the exact same script succeeds
and poudriere shuts down cleanly.  poudriere jail -k seems to mostly
work ok for recovering.

This just started last week after near a year of flawless cron'd jobs.
(poudriere was flawless, ports are another matter).

>> htop now shows no significant activity for the specified 3 builders:
>>
>> root@terpsichore> ps xa | grep poud
>> 72482  -  Is       0:00.01 /bin/sh /root/poudriere/run-poudriere-bulk
>> 73202  -  S        0:04.24 sh -e /usr/local/share/poudriere/bulk.sh -f
>> /root/poudriere/ports -j 10-stable-amd64
>> 73347  -  S        1:55.38 sh -e /usr/local/share/poudriere/bulk.sh -f
>> /root/poudriere/ports -j 10-stable-amd64
>> 73352  -  I        0:00.08 sh -e /usr/local/share/poudriere/bulk.sh -f
>> /root/poudriere/ports -j 10-stable-amd64
>>   6119  1  S+       0:00.00 grep poud
>> root@terpsichore>
>>
>> If I reboot, so that the tmp zfs filesystems are unmounted, and
>> manually rerun the exact same script as the previous cron'd, hung
>> instance, poudriere has (so far) run to completion.
>
> Please record 'procstat -kka' before rebooting in case this is some kind
> of deadlock.

Will do.  Many thanks for the suggestions.  It sure smells like luser
fail but I don't see it yet...

Best,
Russell


>>
>> I'm not sure how to debug this, but in the interim, I'm very curious
>> how I can stop the hung bulk run, and either restart it, or clean up
>> the various mounted zfs filesystems and manually restart from the
>> beginning w/o rebooting.  Studying the man page, it's not clear at all
>> the Right Way to do this, so any pointers here would be appreciated.
>
> Kill -TERM the main poudriere process. It will clean up children.
>
> Beyond that you can 'poudriere jail -j NAME -p TREE -z SET -k' to clean
> up any mounts leftover from a previous build.
>
> Adding a 'poudriere kill' command is on the todo list.
>
>>
>> I'm leaving the system untouched for now so that I can try out any
>> suggestions for cleanup and restart.
>
>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?56314A72.9020005>