FreeBSD Mail Archives

Date:      Sun, 11 Mar 2018 01:40:58 +0100
From:      Michael Gmelin <freebsd@grem.de>
To:        "O. Hartmann" <ohartmann@walstatt.org>
Cc:        Roman Bogorodskiy <novel@FreeBSD.org>, "Danilo G. Baio" <dbaio@FreeBSD.org>, "Rodney W. Grimes" <freebsd-rwg@pdx.rh.CN85.dnsmgr.net>, Trond Endrest?l <Trond.Endrestol@fagskolen.gjovik.no>, FreeBSD current <freebsd-current@freebsd.org>, Kurt Jaeger <lists@opsec.eu>
Subject:   Re: Strange ARC/Swap/CPU on yesterday's -CURRENT
Message-ID:  <CF9EA28E-965E-4BDA-9093-C13E70793338@grem.de>
In-Reply-To: <20180311004737.3441dbf9@thor.intern.walstatt.dynvpn.de>
References:  <20180306173455.oacyqlbib4sbafqd@ler-imac.lerctr.org> <201803061816.w26IGaW5050053@pdx.rh.CN85.dnsmgr.net> <20180306193645.vv3ogqrhauivf2tr@ler-imac.lerctr.org> <20180306221554.uyshbzbboai62rdf@dx240.localdomain> <20180307103911.GA72239@kloomba> <20180311004737.3441dbf9@thor.intern.walstatt.dynvpn.de>



> On 11. Mar 2018, at 00:47, O. Hartmann <ohartmann@walstatt.org> wrote:
>=20
> Am Wed, 7 Mar 2018 14:39:13 +0400
> Roman Bogorodskiy <novel@FreeBSD.org> schrieb:
>=20
>>  Danilo G. Baio wrote:
>>=20
>>>> On Tue, Mar 06, 2018 at 01:36:45PM -0600, Larry Rosenman wrote: =20
>>>> On Tue, Mar 06, 2018 at 10:16:36AM -0800, Rodney W. Grimes wrote: =20
>>>>>> On Tue, Mar 06, 2018 at 08:40:10AM -0800, Rodney W. Grimes wrote: =20=

>>>>>>>> On Mon, 5 Mar 2018 14:39-0600, Larry Rosenman wrote:
>>>>>>>>=20
>>>>>>>>> Upgraded to:
>>>>>>>>>=20
>>>>>>>>> FreeBSD borg.lerctr.org 12.0-CURRENT FreeBSD 12.0-CURRENT #11 r330=
385:
>>>>>>>>> Sun Mar  4 12:48:52 CST 2018
>>>>>>>>> root@borg.lerctr.org:/usr/obj/usr/src/amd64.amd64/sys/VT-LER  amd6=
4
>>>>>>>>> +1200060 1200060
>>>>>>>>>=20
>>>>>>>>> Yesterday, and I'm seeing really strange slowness, ARC use, and SW=
AP use
>>>>>>>>> and swapping.
>>>>>>>>>=20
>>>>>>>>> See http://www.lerctr.org/~ler/FreeBSD/Swapuse.png =20
>>>>>>>>=20
>>>>>>>> I see these symptoms on stable/11. One of my servers has 32 GiB of=20=

>>>>>>>> RAM. After a reboot all is well. ARC starts to fill up, and I still=
=20
>>>>>>>> have more than half of the memory available for user processes.
>>>>>>>>=20
>>>>>>>> After running the periodic jobs at night, the amount of wired memor=
y=20
>>>>>>>> goes sky high. /etc/periodic/weekly/310.locate is a particular nast=
y=20
>>>>>>>> one. =20
>>>>>>>=20
>>>>>>> I would like to find out if this is the same person I have
>>>>>>> reporting this problem from another source, or if this is
>>>>>>> a confirmation of a bug I was helping someone else with.
>>>>>>>=20
>>>>>>> Have you been in contact with Michael Dexter about this
>>>>>>> issue, or any other forum/mailing list/etc?   =20
>>>>>> Just IRC/Slack, with no response. =20
>>>>>>>=20
>>>>>>> If not then we have at least 2 reports of this unbound
>>>>>>> wired memory growth, if so hopefully someone here can
>>>>>>> take you further in the debug than we have been able
>>>>>>> to get. =20
>>>>>> What can I provide?  The system is still in this state as the full ba=
ckup is
>>>>>> slow. =20
>>>>>=20
>>>>> One place to look is to see if this is the recently fixed:
>>>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D222288
>>>>> g_bio leak.
>>>>>=20
>>>>> vmstat -z | egrep 'ITEM|g_bio|UMA'
>>>>>=20
>>>>> would be a good first look
>>>>>=20
>>>> borg.lerctr.org /home/ler $ vmstat -z | egrep 'ITEM|g_bio|UMA'
>>>> ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEE=
P
>>>> UMA Kegs:               280,      0,     346,       5,     560,   0,   0=

>>>> UMA Zones:             1928,      0,     363,       1,     577,   0,   0=

>>>> UMA Slabs:              112,      0,25384098,  977762,102033225,   0,  =
 0
>>>> UMA Hash:               256,      0,      59,      16,     105,   0,   0=

>>>> g_bio:                  384,      0,      33,    1627,542482056,   0,  =
 0
>>>> borg.lerctr.org /home/ler $ =20
>>>>>>>> Limiting the ARC to, say, 16 GiB, has no effect of the high amount o=
f=20
>>>>>>>> wired memory. After a few more days, the kernel consumes virtually a=
ll=20
>>>>>>>> memory, forcing processes in and out of the swap device. =20
>>>>>>>=20
>>>>>>> Our experience as well.
>>>>>>>=20
>>>>>>> ...
>>>>>>>=20
>>>>>>> Thanks,
>>>>>>> Rod Grimes
>>>>>>> rgrimes@freebsd.org =20
>>>>>> Larry Rosenman                     http://www.lerctr.org/~ler =20
>>>>>=20
>>>>> --=20
>>>>> Rod Grimes                                                 rgrimes@fre=
ebsd.org =20
>>>>=20
>>>> --=20
>>>> Larry Rosenman                     http://www.lerctr.org/~ler
>>>> Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
>>>> US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106 =20
>>>=20
>>>=20
>>> Hi.
>>>=20
>>> I noticed this behavior as well and changed vfs.zfs.arc_max for a smalle=
r size.
>>>=20
>>> For me it started when I upgraded to 1200058, in this box I'm only using=

>>> poudriere for building tests. =20
>>=20
>> I've noticed that as well.
>>=20
>> I have 16G of RAM and two disks, the first one is UFS with the system
>> installation and the second one is ZFS which I use to store media and
>> data files and for poudreire.
>>=20
>> I don't recall the exact date, but it started fairly recently. System wou=
ld
>> swap like crazy to a point when I cannot even ssh to it, and can hardly
>> login through tty: it might take 10-15 minutes to see a command typed in
>> the shell.
>>=20
>> I've updated loader.conf to have the following:
>>=20
>> vfs.zfs.arc_max=3D"4G"
>> vfs.zfs.prefetch_disable=3D"1"
>>=20
>> It fixed the problem, but introduced a new one. When I'm building stuff
>> with poudriere with ccache enabled, it takes hours to build even small
>> projects like curl or gnutls.
>>=20
>> For example, current build:
>>=20
>> [10i386-default] [2018-03-07_07h44m45s] [parallel_build:] Queued: 3  Buil=
t: 1  Failed:
>> 0  Skipped: 0  Ignored: 0  Tobuild: 2   Time: 06:48:35 [02]: security/gnu=
tls
>> | gnutls-3.5.18             build           (06:47:51)
>>=20
>> Almost 7 hours already and still going!
>>=20
>> gstat output looks like this:
>>=20
>> dT: 1.002s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>>    0      0      0      0    0.0      0      0    0.0    0.0  da0
>>    0      1      0      0    0.0      1    128    0.7    0.1  ada0
>>    1    106    106    439   64.6      0      0    0.0   98.8  ada1
>>    0      1      0      0    0.0      1    128    0.7    0.1  ada0s1
>>    0      0      0      0    0.0      0      0    0.0    0.0  ada0s1a
>>    0      0      0      0    0.0      0      0    0.0    0.0  ada0s1b
>>    0      1      0      0    0.0      1    128    0.7    0.1  ada0s1d
>>=20
>> ada0 here is UFS driver, and ada1 is ZFS.
>>=20
>>> Regards.
>>> --=20
>>> Danilo G. Baio (dbaio) =20
>>=20
>>=20
>>=20
>> Roman Bogorodskiy
>=20
>=20
> This is from a APU, no ZFS, UFS on a small mSATA device, the APU (PCenigin=
e) works as a
> firewall, router, PBX):
>=20
> last pid:  9665;  load averages:  0.13,  0.13,  0.11
> up 3+06:53:55  00:26:26 19 processes:  1 running, 18 sleeping CPU:  0.3% u=
ser,  0.0%
> nice,  0.2% system,  0.0% interrupt, 99.5% idle Mem: 27M Active, 6200K Ina=
ct, 83M
> Laundry, 185M Wired, 128K Buf, 675M Free Swap: 7808M Total, 2856K Used, 78=
05M Free
> [...]
>=20
> The APU is running CURRENT ( FreeBSD 12.0-CURRENT #42 r330608: Wed Mar  7 1=
6:55:59 CET
> 2018 amd64). Usually, the APU never(!) uses swap, now it is starting to sw=
ap like hell
> for a couple of days and I have to reboot it failty often.
>=20
> Another box, 16 GB RAM, ZFS, poudriere, the packaging box, is right now un=
responsible:
> after hours of building packages, I tried to copy the repository from one l=
ocation on
> the same ZFS volume to another - usually this task takes a couple of minut=
es for ~ 2200
> ports. Now, I has taken 2 1/2 hours and the box got stuck, Ctrl-T  on the c=
onsole
> delivers:
> load: 0.00  cmd: make 91199 [pfault] 7239.56r 0.03u 0.04s 0% 740k
>=20
> No response from the box anymore.
>=20
>=20
> The problem of swapping like hell and performing slow isn't an issue of th=
e past days, it
> is present at least since 1 1/2 weeks for now, even more. Since I build po=
rts fairly
> often, time taken on that specific box has increased from 2 to 3 days for a=
ll ~2200
> ports. The system has 16 GB of RAM, IvyBridge 4-core XEON at 3,4 GHz, if t=
his information
> matters. The box is consuming swap really fast.
>=20
> Today is the first time the machine got inresponsible (no ssh, no console l=
ogin so far).
> Need to coldstart. OS is CURRENT as well.
>=20

Any chance this is related to meltdown/spectre mitigation patches?

Best,
Michael



> Regards,
>=20
> O. Hartmann
>=20
>=20
> --=20
> O. Hartmann
>=20
> Ich widerspreche der Nutzung oder =C3=9Cbermittlung meiner Daten f=C3=BCr
> Werbezwecke oder f=C3=BCr die Markt- oder Meinungsforschung (=C2=A7 28 Abs=
. 4 BDSG).

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CF9EA28E-965E-4BDA-9093-C13E70793338>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation