Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 23 Jul 2018 17:12:56 +0200
From:      Mark Martinec <Mark.Martinec+freebsd@ijs.si>
To:        stable@FreeBSD.org
Subject:   All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64
Message-ID:  <1a039af7758679ba1085934b4fb81b57@ijs.si>

next in thread | raw e-mail | index | archive | help
After upgrading an older AMD host from FreeBSD 10.3 to 11.1-RELEASE-p11
(amd64), ZFS is gradually eating up all memory, so that it crashes every
few days when the memory is completely exhausted (after swapping heavily
for a couple of hours).

This machine has only 4 GB of memory. After capping up the ZFS ARC
to 1.8 GB the machine can now stay up a bit longer, but in four days
all the memory is used up. The machine is lightly loaded, it runs
a bind resolver and a lightly used web server, the ps output
does not show any excessive memory use by any process.

During the last survival period I ran  vmstat -m  every second
and logged results. What caught my eye was the 'solaris' entry,
which seems to explain all the exhaustion.

The MemUse for the solaris entry starts modestly, e.g. after a few
hours of uptime:

$ vmstat -m :
          Type InUse MemUse HighUse Requests  Size(s)
       solaris 3141552 225178K       - 12066929  
16,32,64,128,256,512,1024,2048,4096,8192,16384,32768

... but this number keeps steadily growing.

After about four days, shortly before a crash, it grew to 2.5 GB,
which gets dangerously close to all the available memory:

       solaris 39359484 2652696K       - 234986296  
16,32,64,128,256,512,1024,2048,4096,8192,16384,32768

Plotting the 'solaris' MemUse entry vs. wall time in seconds, one can 
see
a steady linear growth, about 25 MB per hour. On a fine-resolution small 
scale
the step size seems to be one small step increase per about 6 seconds.
All steps are small, but not all are the same size.

The only thing (in my mind) that distinguishes this host from others
running 11.1 seems to be that one of the two ZFS pools is down because
its disk is broken. This is a scratch data pool, not otherwise in use.
The pool with the OS is healthy.

The syslog shows entries like the following periodically:

Jul 23 16:48:49 xxx ZFS: vdev state changed, 
pool_guid=15371508659919408885 vdev_guid=11732693005294113354
Jul 23 16:49:09 xxx ZFS: vdev state changed, 
pool_guid=15371508659919408885 vdev_guid=11732693005294113354
Jul 23 16:55:34 xxx ZFS: vdev state changed, 
pool_guid=15371508659919408885 vdev_guid=11732693005294113354

The 'zpool status -v' on this pool shows:

   pool: stuff
  state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
         replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
    see: http://illumos.org/msg/ZFS-8000-3C
   scan: none requested
config:

         NAME                    STATE     READ WRITE CKSUM
         stuff                   UNAVAIL      0     0     0
           11732693005294113354  UNAVAIL      0     0     0  was /dev/da2


The same machine with this broken pool could previously survive 
indefinitely
under FreeBSD 10.3 .

So, could this be the reason for memory depletion?
Any fixes for that? Any more tests suggested to perform
before I try to get rid of this pool?

   Mark



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1a039af7758679ba1085934b4fb81b57>