Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 19 Aug 2016 16:52:00 -0500
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-fs@freebsd.org
Subject:   Re: ZFS ARC under memory pressure
Message-ID:  <05ba785a-c86f-1ec8-fcf3-71d22551f4f3@denninger.net>
In-Reply-To: <20160819213446.GT8192@zxy.spb.ru>
References:  <20160816193416.GM8192@zxy.spb.ru> <8dbf2a3a-da64-f7f8-5463-bfa23462446e@FreeBSD.org> <20160818202657.GS8192@zxy.spb.ru> <c3bc6c5a-961c-e3a4-2302-f0f7417bc34f@denninger.net> <20160819201840.GA12519@zxy.spb.ru> <bcb14d0b-bd6d-cb93-ea71-3656cfce8b3b@denninger.net> <20160819213446.GT8192@zxy.spb.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a cryptographically signed message in MIME format.

--------------ms040808020408080903000805
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable



On 8/19/2016 16:34, Slawa Olhovchenkov wrote:
> On Fri, Aug 19, 2016 at 03:38:55PM -0500, Karl Denninger wrote:
>
>> On 8/19/2016 15:18, Slawa Olhovchenkov wrote:
>>> On Thu, Aug 18, 2016 at 03:31:26PM -0500, Karl Denninger wrote:
>>>
>>>> On 8/18/2016 15:26, Slawa Olhovchenkov wrote:
>>>>> On Thu, Aug 18, 2016 at 11:00:28PM +0300, Andriy Gapon wrote:
>>>>>
>>>>>> On 16/08/2016 22:34, Slawa Olhovchenkov wrote:
>>>>>>> I see issuses with ZFS ARC inder memory pressure.
>>>>>>> ZFS ARC size can be dramaticaly reduced, up to arc_min.
>>>>>>>
>>>>>>> As I see memory pressure event cause call arc_lowmem and set need=
free:
>>>>>>>
>>>>>>> arc.c:arc_lowmem
>>>>>>>
>>>>>>>         needfree =3D btoc(arc_c >> arc_shrink_shift);
>>>>>>>
>>>>>>> After this, arc_available_memory return negative vaules (PAGESIZE=
 *
>>>>>>> (-needfree)) until needfree is zero. Independent how too much mem=
ory
>>>>>>> freed. needfree set to 0 in arc_reclaim_thread(), when arc_size <=
=3D
>>>>>>> arc_c. Until arc_size don't drop below arc_c (arc_c deceased at e=
very
>>>>>>> loop interation).
>>>>>>>
>>>>>>> arc_c droped to minimum value if arc_size fast enough droped.
>>>>>>>
>>>>>>> No control current to initial memory allocation.
>>>>>>>
>>>>>>> As result, I can see needless arc reclaim, from 10x to 100x times=
=2E
>>>>>>>
>>>>>>> Can some one check me and comment this?
>>>>>> You might have found a real problem here, but I am short of time r=
ight now to
>>>>>> properly analyze the issue.  I think that on illumos 'needfree' is=
 a variable
>>>>>> that's managed by the virtual memory system and it is akin to our
>>>>>> vm_pageout_deficit.  But during the porting it became an artificia=
l value and
>>>>>> its handling might be sub-optimal.
>>>>> As I see, totaly not optimal.
>>>>> I am create some patch for sub-optimal handling and now test it.
>>>>> _______________________________________________
>>>>> freebsd-fs at freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>>>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd=
=2Eorg"
>>>> You might want to look at the code contained in here:
>>>>
>>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594
>>> In may case arc.c issuse cused by revision r286625 in HEAD (and
>>> r288562 in STABLE) -- all in 2015, not touch in 2014.
>>>
>>>> There are some ugly interactions with the VM system you can run into=
 if
>>>> you're not careful; I've chased this issue before and while I haven'=
t
>>>> yet done the work to integrate it into 11.x (and the underlying code=

>>>> *has* changed since the 10.x patches I developed) if you wind up dri=
ving
>>>> the VM system to evict pages to swap rather than pare back ARC you'r=
e
>>>> probably making the wrong choice.
>>>>
>>>> In addition UMA can come into the picture too and (at least previous=
ly)
>>>> was a severe contributor to pathological behavior.
>>> I am only do less aggresive (and more controlled) shrink of ARC size.=

>>> Now ARC just collapsed.
>>>
>>> Pointed PR is realy BIG. I am can't read and understund all of this.
>>> r286625 change behaivor of interaction between ARC and VM.
>>> You problem still exist? Can you explain (in list)?
>>>
>> Essentially ZFS is a "bolt-on" and unlike UFS which uses the unified
>> buffer cache (which the VM system manages) ZFS does not.  ARC is
>> allocated out of kernel memory and (by default) also uses UMA; the VM
>> system is not involved in its management.
>>
>> When the VM system gets constrained (low memory) it thus cannot tell t=
he
>> ARC to pare back.  So when the VM system gets low on RAM it will start=

> Currently VM generate event and ARC listen for this event, handle it
> by arc.c:arc_lowmem().
>
>> to page.  The problem with this is that if the VM system is low on RAM=

>> because the ARC is consuming memory you do NOT want to page, you want =
to
>> evict some of the ARC.
> Now by event `lowmem` ARC try to evict 1/128 of ARC.
>
>> Unfortunately the VM system has another interaction that causes troubl=
e
>> too.  The VM system will "demote" a page to inactive or cache status b=
ut
>> not actually free it.  It only starts to go through those pages and fr=
ee
>> them when the vm system wakes up, and that only happens when free spac=
e
>> gets low enough to trigger it.
>
>> Finally, there's another problem that comes into play; UMA.  Kernel
>> memory allocation is fairly expensive.  UMA grabs memory from the kern=
el
>> allocation system in big chunks and manages it, and by doing so gains =
a
>> pretty-significant performance boost.  But this means that you can hav=
e
>> large amounts of RAM that are allocated, not in use, and yet the VM
>> system cannot reclaim them on its own.  The ZFS code has to reap those=

>> caches, but reaping them is a moderately expensive operation too, thus=

>> you don't want to do it unnecessarily.
> Not sure, but some code in ZFS may be handle this.
> arc.c:arc_kmem_reap_now().
> Not sure.
>
>> I've not yet gone through the 11.x code to see what changed from 10.x;=

>> what I do know is that it is materially better-behaved than it used to=

>> be, in that prior to 11.x I would have (by now) pretty much been force=
d
>> into rolling that forward and testing it because the misbehavior in on=
e
>> of my production systems was severe enough to render it basically
>> unusable without the patch in that PR inline, with the most-serious
>> misbehavior being paging-induced stalls that could reach 10s of second=
s
>> or more in duration.
>>
>> 11.x hasn't exhibited the severe problems, unpatched, that 10.x was
>> known to do on my production systems -- but it is far less than great =
in
>> that it sure as heck does have UMA coherence issues.....
>>
>> ARC Size:                               38.58%  8.61    GiB
>>         Target Size: (Adaptive)         70.33%  15.70   GiB
>>         Min Size (Hard Limit):          12.50%  2.79    GiB
>>         Max Size (High Water):          8:1     22.32   GiB
>>
>> I have 20GB out in kernel memory on this machine right now but only 8.=
6
>> of it in ARC; the rest is (mostly) sitting in UMA allocated-but-unused=

>> -- so despite the belief expressed by some that the 11.x code is
>> "better" at reaping UMA I'm sure not seeing it here.
> I see.
> In my case:
>
> ARC Size:                               79.65%  98.48   GiB
>         Target Size: (Adaptive)         79.60%  98.42   GiB
>         Min Size (Hard Limit):          12.50%  15.46   GiB
>         Max Size (High Water):          8:1     123.64  GiB
>
> System Memory:
>
>         2.27%   2.83    GiB Active,     9.58%   11.94   GiB Inact
>         86.34%  107.62  GiB Wired,      0.00%   0 Cache
>         1.80%   2.25    GiB Free,       0.00%   0 Gap
>
>         Real Installed:                         128.00  GiB
>         Real Available:                 99.96%  127.95  GiB
>         Real Managed:                   97.41%  124.64  GiB
>
>         Logical Total:                          128.00  GiB
>         Logical Used:                   88.92%  113.81  GiB
>         Logical Free:                   11.08%  14.19   GiB
>
> Kernel Memory:                                  758.25  MiB
>         Data:                           97.81%  741.61  MiB
>         Text:                           2.19%   16.64   MiB
>
> Kernel Memory Map:                              124.64  GiB
>         Size:                           81.84%  102.01  GiB
>         Free:                           18.16%  22.63   GiB
>
> Mem: 2895M Active, 12G Inact, 108G Wired, 528K Buf, 2303M Free
> ARC: 98G Total, 89G MFU, 9535M MRU, 35M Anon, 126M Header, 404M Other
> Swap: 32G Total, 394M Used, 32G Free, 1% Inuse
>
> Is this 12G Inactive as 'UMA allocated-but-unused'?
> This is also may be freed but not reclaimed network bufs.
>
>> I'll get around to rolling forward and modifying that PR since that
>> particular bit of jackassery with UMA is a definite performance
>> problem.  I suspect a big part of what you're seeing lies there as
>> well.  When I do get that code done and tested I suspect it may solve
>> your problems as well.
> No. May problem is completly different: under memory pressure, after ar=
c_lowmem()
> set needfree to non-zero arc_reclaim_thread() start to shrink ARC. But
> arc_reclaim_thread (in FreeBSD case) don't correctly control this proce=
ss
> and shrink stoped in random time (when after next iteration arc_size <=3D=
 arc_c),
> mostly after drop to Min Size (Hard Limit).
>
> I am just resore control of shrink process.
Not quite due to the UMA issue, among other things.  There's also a
potential "stall" issue that can arise also having to do with dirty_max
sizing, especially if you are using rotating media.  The PR patch scaled
that back dynamically as well under memory pressure and eliminated that
issue as well.

I won't have time to look at this for at least another week on my test
machine as I'm unfortunately buried with unrelated work at present, but
I should be able to put some effort into this within the next couple
weeks and see if I can quickly roll forward the important parts of the
previous PR patch.

I think you'll find that it stops the behavior you're seeing - I'm just
pointing out that this was more-complex internally than it first
appeared in the 10.x branch and I have no reason to believe the
interactions that lead to bad behavior are not still in play given what
you're describing for symptoms.

--=20
Karl Denninger
karl@denninger.net <mailto:karl@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

--------------ms040808020408080903000805
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgMFADCABgkqhkiG9w0BBwEAAKCC
Bl8wggZbMIIEQ6ADAgECAgEpMA0GCSqGSIb3DQEBCwUAMIGQMQswCQYDVQQGEwJVUzEQMA4G
A1UECBMHRmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3Rl
bXMgTExDMRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhND
dWRhIFN5c3RlbXMgTExDIENBMB4XDTE1MDQyMTAyMjE1OVoXDTIwMDQxOTAyMjE1OVowWjEL
MAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM
TEMxHjAcBgNVBAMTFUthcmwgRGVubmluZ2VyIChPQ1NQKTCCAiIwDQYJKoZIhvcNAQEBBQAD
ggIPADCCAgoCggIBALmEWPhAdphrWd4K5VTvE5pxL3blRQPyGF3ApjUjgtavqU1Y8pbI3Byg
XDj2/Uz9Si8XVj/kNbKEjkRh5SsNvx3Fc0oQ1uVjyCq7zC/kctF7yLzQbvWnU4grAPZ3IuAp
3/fFxIVaXpxEdKmyZAVDhk9az+IgHH43rdJRIMzxJ5vqQMb+n2EjadVqiGPbtG9aZEImlq7f
IYDTnKyToi23PAnkPwwT+q1IkI2DTvf2jzWrhLR5DTX0fUYC0nxlHWbjgpiapyJWtR7K2YQO
aevQb/3vN9gSojT2h+cBem7QIj6U69rEYcEDvPyCMXEV9VcXdcmW42LSRsPvZcBHFkWAJqMZ
Myiz4kumaP+s+cIDaXitR/szoqDKGSHM4CPAZV9Yh8asvxQL5uDxz5wvLPgS5yS8K/o7zDR5
vNkMCyfYQuR6PAJxVOk5Arqvj9lfP3JSVapwbr01CoWDBkpuJlKfpQIEeC/pcCBKknllbMYq
yHBO2TipLyO5Ocd1nhN/nOsO+C+j31lQHfOMRZaPQykXVPWG5BbhWT7ttX4vy5hOW6yJgeT/
o3apynlp1cEavkQRS8uJHoQszF6KIrQMID/JfySWvVQ4ksnfzwB2lRomrdrwnQ4eG/HBS+0l
eozwOJNDIBlAP+hLe8A5oWZgooIIK/SulUAsfI6Sgd8dTZTTYmlhAgMBAAGjgfQwgfEwNwYI
KwYBBQUHAQEEKzApMCcGCCsGAQUFBzABhhtodHRwOi8vY3VkYXN5c3RlbXMubmV0Ojg4ODgw
CQYDVR0TBAIwADARBglghkgBhvhCAQEEBAMCBaAwCwYDVR0PBAQDAgXgMCwGCWCGSAGG+EIB
DQQfFh1PcGVuU1NMIEdlbmVyYXRlZCBDZXJ0aWZpY2F0ZTAdBgNVHQ4EFgQUxRyULenJaFwX
RtT79aNmIB/u5VkwHwYDVR0jBBgwFoAUJHGbnYV9/N3dvbDKkpQDofrTbTUwHQYDVR0RBBYw
FIESa2FybEBkZW5uaW5nZXIubmV0MA0GCSqGSIb3DQEBCwUAA4ICAQBPf3cYtmKowmGIYsm6
eBinJu7QVWvxi1vqnBz3KE+HapqoIZS8/PolB/hwiY0UAE1RsjBJ7yEjihVRwummSBvkoOyf
G30uPn4yg4vbJkR9lTz8d21fPshWETa6DBh2jx2Qf13LZpr3Pj2fTtlu6xMYKzg7cSDgd2bO
sJGH/rcvva9Spkx5Vfq0RyOrYph9boshRN3D4tbWgBAcX9POdXCVfJONDxhfBuPHsJ6vEmPb
An+XL5Yl26XYFPiODQ+Qbk44Ot1kt9s7oS3dVUrh92Qv0G3J3DF+Vt6C15nED+f+bk4gScu+
JHT7RjEmfa18GT8DcT//D1zEke1Ymhb41JH+GyZchDRWtjxsS5OBFMzrju7d264zJUFtX7iJ
3xvpKN7VcZKNtB6dLShj3v/XDsQVQWXmR/1YKWZ93C3LpRs2Y5nYdn6gEOpL/WfQFThtfnat
HNc7fNs5vjotaYpBl5H8+VCautKbGOs219uQbhGZLYTv6okuKcY8W+4EJEtK0xB08vqr9Jd0
FS9MGjQE++GWo+5eQxFt6nUENHbVYnsr6bYPQsZH0CRNycgTG9MwY/UIXOf4W034UpR82TBG
1LiMsYfb8ahQJhs3wdf1nzipIjRwoZKT1vGXh/cj3gwSr64GfenURBxaFZA5O1acOZUjPrRT
n3ci4McYW/0WVVA3lDGCBRMwggUPAgEBMIGWMIGQMQswCQYDVQQGEwJVUzEQMA4GA1UECBMH
RmxvcmlkYTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExD
MRwwGgYDVQQDExNDdWRhIFN5c3RlbXMgTExDIENBMSIwIAYJKoZIhvcNAQkBFhNDdWRhIFN5
c3RlbXMgTExDIENBAgEpMA0GCWCGSAFlAwQCAwUAoIICTTAYBgkqhkiG9w0BCQMxCwYJKoZI
hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNjA4MTkyMTUyMDBaME8GCSqGSIb3DQEJBDFCBEAK
hJi5/8ptyPvenRhWie/BSME8lhs9BQnHdC6flidXNcBCWBhTvA0NrlqjIYn/ORlwXesJRByf
t14fEPqQtrVaMGwGCSqGSIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEqMAsGCWCGSAFlAwQBAjAK
BggqhkiG9w0DBzAOBggqhkiG9w0DAgICAIAwDQYIKoZIhvcNAwICAUAwBwYFKw4DAgcwDQYI
KoZIhvcNAwICASgwgacGCSsGAQQBgjcQBDGBmTCBljCBkDELMAkGA1UEBhMCVVMxEDAOBgNV
BAgTB0Zsb3JpZGExEjAQBgNVBAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1z
IExMQzEcMBoGA1UEAxMTQ3VkYSBTeXN0ZW1zIExMQyBDQTEiMCAGCSqGSIb3DQEJARYTQ3Vk
YSBTeXN0ZW1zIExMQyBDQQIBKTCBqQYLKoZIhvcNAQkQAgsxgZmggZYwgZAxCzAJBgNVBAYT
AlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1
ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExIjAgBgkqhkiG
9w0BCQEWE0N1ZGEgU3lzdGVtcyBMTEMgQ0ECASkwDQYJKoZIhvcNAQEBBQAEggIAM2mjKv7n
smv9SiI6bPPW708oruljYXQpJPRsM0HD8/hYLn5TPsVysnWZwuZCUrNikEBrQI5qqMmpYt9n
o/DrVAhOiupZ2Jz8/oO7KJ+EEdMCABFdY9LRowdpJTHOhYUkaJ5D4YFg/EKP3a8RWGZ6av07
Iy4WZliVOVAV8147Pqxc/YJRxqEM225WV4riC2KkGgskNmYzB9M/nsNNTJiT0EhGxJIq/qfS
k5WwkSAMOpUj8M3dI6pOCyIDIqjSUc4wxoVa4UXrdgx5VvXIZCsaatC8USfjCi9j1UE0aACe
/CiPQFNIoesa+yMGszJ5jmHQAt1Wv/95nTQlfN6hEnZw015hGq6Wh3IPb4ajBVyy5TzEOiCV
qiql3Z8ccHGaBjQDlSqK+CM/8ApZSeXE/CpThaGRPdUyZBQ51XRLvYzqVnAAM2bPOAgrd2kw
ND8Ez7O2N3dpQJlc9pNKM7k7M0bfBSNp+bnjj3bLiiTFNA0fHnCKB2a1Eowucw3jDuVZ2jy3
OTUnBlWN48cE94fsMZ8hh2jYRZ7PHDLrveWUsCTkh8zPoN3rnWOrgw+SDBTREGp4rtJn7nQo
popEmhSAR5ZZ6txJ65XAhISwOcHaTJMTn5CitAAG03koJjHK244t64e9P2BiB0LqMrxehM5T
tOFRe/TzVhNIxUwq26xOfvP/pDYAAAAAAAA=
--------------ms040808020408080903000805--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?05ba785a-c86f-1ec8-fcf3-71d22551f4f3>