Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 18 Jul 2013 14:23:25 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Dan Thomas <godders@gmail.com>
Cc:        Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org, Palle Girgensohn <girgen@freebsd.org>, Jeff Roberson <jroberson@jroberson.net>, Julian Akehurst <julian@pingpong.se>
Subject:   Re: leaking lots of unreferenced inodes (pg_xlog files?)
Message-ID:  <20130718112325.GZ5991@kib.kiev.ua>
In-Reply-To: <CAG8duQ2gB11=bXzJ6hFFzzNUSwofn3WGd4=EEPJuhyNr6UmjwQ@mail.gmail.com>
References:  <201307151932.r6FJWSxM087108@chez.mckusick.com> <51E5CD7A.2020109@FreeBSD.org> <20130717053431.GN5991@kib.kiev.ua> <CAG8duQ2gB11=bXzJ6hFFzzNUSwofn3WGd4=EEPJuhyNr6UmjwQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--MrbiU6dcJfOZ616B
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Jul 18, 2013 at 11:43:54AM +0100, Dan Thomas wrote:
> After a bit of experimentation, we've managed to nail down a
> reasonably short run that exhibits this leak. Postgres' verbose log
> output is linked below - whatever is causing the leak is in there
> somewhere, but alas I lack the necessary understanding of Postgres'
> internals to be able to pin it down any further.
>=20
> https://dl.dropboxusercontent.com/u/13916028/pg_leak_log.txt
This is of no use, at least for me.

>=20
> I've also got a 2.4M ktrace of this run, which is still pretty big,
> I'll admit. Unfortunately it's got some data in it that I'd rather not
> publish, but I'm happy to send it directly to anyone who might find it
> useful.
Such big ktrace is also unusable.

If you want me to look at the leak, use the patch which I sent earlier,
and add the flag to the vnodes which are likely to be leaked.  Then,
after 'show vnode', I would be able to see what is going on, I hope.

>=20
> Thanks,
>=20
> Dan
>=20
> On 17 July 2013 06:34, Konstantin Belousov <kostikbel@gmail.com> wrote:
> > On Wed, Jul 17, 2013 at 12:47:22AM +0200, Palle Girgensohn wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> Kirk McKusick skrev:
> >> >> Date: Mon, 15 Jul 2013 10:51:10 +0100 Subject: Re: leaking lots of
> >> >> unreferenced inodes (pg_xlog files?) From: Dan Thomas
> >> >> <godders@gmail.com> To: Kirk McKusick <mckusick@mckusick.com> Cc:
> >> >> Palle Girgensohn <girgen@freebsd.org>, freebsd-fs@freebsd.org, Jeff
> >> >> Roberson <jroberson@jroberson.net>, Julian Akehurst
> >> >> <julian@pingpong.se> X-ASK-Info: Message Queued (2013/07/15
> >> >> 02:51:22) X-ASK-Info: Confirmed by User (2013/07/15 02:55:04)
> >> >>
> >> >> On 11 June 2013 01:17, Kirk McKusick <mckusick@mckusick.com>
> >> >> wrote:
> >> >>> OK, good to have it narrowed down. I will look to devise some
> >> >>> additional diagnostics that hopefully will help tease out the
> >> >>> bug. I'll hopefully get back to you soon.
> >> >> Hi,
> >> >>
> >> >> Is there any news on this issue? We're still running several
> >> >> servers that are exhibiting this problem (most recently, one that
> >> >> seems to be leaking around 10gb/hour), and it's getting to the
> >> >> point where we're looking at moving to a different OS until it's
> >> >> resolved.
> >> >>
> >> >> We have access to several production systems with this problem and
> >> >> (at least from time to time) will have systems with a significant
> >> >> leak on them that we can experiment with. Is there any way we can
> >> >> assist with tracking this down? Any diagnostics or testing that
> >> >> would be useful?
> >> >>
> >> >> Thanks, Dan
> >> >
> >> > Hi Dan (and Palle),
> >> >
> >> > Sorry for the long delay with no help / news. I have gotten
> >> > side-tracked on several projects and have had little time to try and
> >> > devise some tests that would help find the cause of the lost space.
> >> > It almost certainly is a one-line fix (a missing vput or vrele
> >> > probably in some error path), but finding where it goes is the hard
> >> > part :-)
> >> >
> >> > I have had little success in inserting code that tracks reference
> >> > counts (too many false positives). So, I am going to need some help
> >> > from you to narrow it down. My belief is that there is some set of
> >> > filesystem operations (system calls) that are leading to the
> >> > problem. Notably, a file is being created, data put into it, then the
> >> > file is deleted (either before or after being closed).  Somehow a
> >> > reference to that file is persisting despite there being no valid
> >> > reference to it. Hence the filesystem thinks it is still live and is
> >> > not deleting it. When you do the forcible unmount, these files get
> >> > cleared and the space shows back up.
> >> >
> >> > What I need to devise is a small test program doing the set of system
> >> > calls that cause this to happen. The way that I would like to try and
> >> > get it is to have you `ktrace -i' your application and then run your
> >> > application just long enough to create at least one of these lost
> >> > files. The goal is to minimize the amount of ktrace data through
> >> > which we need to sift.
> >> >
> >> > In preparation for doing this test you need to have a kernel compiled
> >> > with `option DIAGNOSTIC' or if you prefer, just add `#define
> >> > DIAGNOSTIC 1' to the top of sys/kern/vfs_subr.c. You will know you
> >> > have at least one offending file when you try to unmount the affected
> >> > filesystem and find it busy. Before doing the `umount -f', enable
> >> > busy printing using `sysctl debug.busyprt=3D1'. Then capture the
> >> > console output which will show the details of all the vnodes that had
> >> > to be forcibly flushed. Hopefully we will then be able to correlate
> >> > them back to the files (NAMI in the ktrace output) with which they
> >> > were associated. We may need to augment the NAMI data with the inode
> >> > number of the associated file to make the association with the
> >> > busyprt output. Anyway, once we have that, we can look at all the
> >> > system calls done on those files and create a small test program that
> >> > exhibits the problem. Given a small test program, Jeff or I can track
> >> > down the offending system call path and nail this pernicious bug once
> >> > and for all.
> >> >
> >> > Kirk McKusick
> >>
> >> Hi,
> >>
> >> I have run ktrace -i on pg_ctl (which forks off all the postgresql
> >> processes) and I got two "busy" files that where "lost" after a few
> >> hours. dmesg reveals this:
> >>
> >> vflush: busy vnode
> >> 0xfffffe067cdde960: tag ufs, type VREG
> >>     usecount 1, writecount 0, refcount 2 mountedhere 0
> >>     flags (VI(0x200))
> >>  VI_LOCKed    v_object 0xfffffe0335922000 ref 0 pages 0
> >>     lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723)
> >>       ino 11047146, on dev da2s1d
> >> vflush: busy vnode
> >> 0xfffffe039f35bb40: tag ufs, type VREG
> >>     usecount 1, writecount 0, refcount 3 mountedhere 0
> >>     flags (VI(0x200))
> >>  VI_LOCKed    v_object 0xfffffe03352701d0 ref 0 pages 0
> >>     lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723)
> >>       ino 11045961, on dev da2s1d
> >>
> >>
> >> I had to umount -f, so they where "lost".
> >>
> >> So, now I have 55 GB ktrace output... ;)  Is there anything I can do to
> >> filter it, or shall I compress it and put it on a web server for you to
> >> fetch as it is?
> >
> > I think that 55GB of ktrace is obviously useless.  The Kirk' idea was to
> > have an isolated test case that would only create the situation trigger=
ing
> > the leak, without irrelevant activity.  This indeed requires drilling d=
own
> > and isolating the file activities to get to the core of problem.
> >
> > FWIW, I and Peter Holm used the following alternative approach quite
> > successfully when tracking down other vnode reference leaks.  The appro=
ach
> > still requires some understanding of the specifics of the problematic
> > files to be useful, but not as much as isolated test.
> >
> > Basically, you take the patch below, and set the VV_DEBUGVREF flag for
> > the vnode that has characteristics as much specific for the leaked vnode
> > as possible.  The patch has example of setting the flag for all new NFS
> > vnodes.  You would probably want to do the same in vfs_vgetf(),
> > checking  e.g. for the partition where your leaks happen.  The limiting
> > of the vnodes for which the vref traces are accumulated is needed to
> > save the kernel memory.
> >
> > Then after the leak was observed, you just print the vnode with ddb
> > command 'show vnode addr' and send the output to developer.
> >
> > Index: sys/sys/vnode.h
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- sys/sys/vnode.h     (revision 248723)
> > +++ sys/sys/vnode.h     (working copy)
> > @@ -94,6 +94,13 @@ struct vpollinfo {
> >
> >  #if defined(_KERNEL) || defined(_KVM_VNODE)
> >
> > +struct debug_ref {
> > +       TAILQ_ENTRY(debug_ref) link;
> > +       int val;
> > +       const char *op;
> > +       struct stack stack;
> > +};
> > +
> >  struct vnode {
> >         /*
> >          * Fields which define the identity of the vnode.  These fields=
 are
> > @@ -169,6 +176,7 @@ struct vnode {
> >         int     v_writecount;                   /* v ref count of write=
rs */
> >         u_int   v_hash;
> >         enum    vtype v_type;                   /* u vnode type */
> > +       TAILQ_HEAD(, debug_ref) v_debug_ref;
> >  };
> >
> >  #endif /* defined(_KERNEL) || defined(_KVM_VNODE) */
> > @@ -253,6 +261,7 @@ struct xvnode {
> >  #define        VV_DELETED      0x0400  /* should be removed */
> >  #define        VV_MD           0x0800  /* vnode backs the md device */
> >  #define        VV_FORCEINSMQ   0x1000  /* force the insmntque to succe=
ed */
> > +#define        VV_DEBUGVREF    0x2000
> >
> >  /*
> >   * Vnode attributes.  A field value of VNOVAL represents a field whose=
 value
> > Index: sys/kern/vfs_subr.c
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- sys/kern/vfs_subr.c (revision 248723)
> > +++ sys/kern/vfs_subr.c (working copy)
> > @@ -71,6 +71,7 @@ __FBSDID("$FreeBSD$");
> >  #include <sys/sched.h>
> >  #include <sys/sleepqueue.h>
> >  #include <sys/smp.h>
> > +#include <sys/stack.h>
> >  #include <sys/stat.h>
> >  #include <sys/sysctl.h>
> >  #include <sys/syslog.h>
> > @@ -871,6 +872,23 @@ static struct kproc_desc vnlru_kp =3D {
> >  };
> >  SYSINIT(vnlru, SI_SUB_KTHREAD_UPDATE, SI_ORDER_FIRST, kproc_start,
> >      &vnlru_kp);
> > +
> > +MALLOC_DEFINE(M_RECORD_REF, "recordref", "recordref");
> > +static void
> > +v_record_ref(struct vnode *vp, int val, const char *op)
> > +{
> > +       struct debug_ref *r;
> > +
> > +       if ((vp->v_type !=3D VREG && vp->v_type !=3D VBAD) ||
> > +           (vp->v_vflag & VV_DEBUGVREF) =3D=3D 0)
> > +               return;
> > +       r =3D malloc(sizeof(struct debug_ref), M_RECORD_REF, M_NOWAIT |
> > +           M_USE_RESERVE);
> > +       r->val =3D val;
> > +       r->op =3D op;
> > +       stack_save(&r->stack);
> > +       TAILQ_INSERT_TAIL(&vp->v_debug_ref, r, link);
> > +}
> >
> >  /*
> >   * Routines having to do with the management of the vnode table.
> > @@ -1073,6 +1091,7 @@ alloc:
> >                         vp->v_vflag |=3D VV_NOKNOTE;
> >         }
> >         rangelock_init(&vp->v_rl);
> > +       TAILQ_INIT(&vp->v_debug_ref);
> >
> >         /*
> >          * For the filesystems which do not use vfs_hash_insert(),
> > @@ -1082,6 +1101,7 @@ alloc:
> >          */
> >         vp->v_hash =3D (uintptr_t)vp >> vnsz2log;
> >
> > +       TAILQ_INIT(&vp->v_debug_ref);
> >         *vpp =3D vp;
> >         return (0);
> >  }
> > @@ -2197,6 +2217,7 @@ vget(struct vnode *vp, int flags, struct thread *t
> >                         vinactive(vp, td);
> >                 vp->v_iflag &=3D ~VI_OWEINACT;
> >         }
> > +       v_record_ref(vp, 1, "vget");
> >         VI_UNLOCK(vp);
> >         return (0);
> >  }
> > @@ -2211,6 +2232,7 @@ vref(struct vnode *vp)
> >         CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
> >         VI_LOCK(vp);
> >         v_incr_usecount(vp);
> > +       v_record_ref(vp, 1, "vref");
> >         VI_UNLOCK(vp);
> >  }
> >
> > @@ -2253,6 +2275,7 @@ vputx(struct vnode *vp, int func)
> >                 KASSERT(func =3D=3D VPUTX_VRELE, ("vputx: wrong func"));
> >         CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
> >         VI_LOCK(vp);
> > +       v_record_ref(vp, -1, "vputx");
> >
> >         /* Skip this v_writecount check if we're going to panic below. =
*/
> >         VNASSERT(vp->v_writecount < vp->v_usecount || vp->v_usecount < =
1, vp,
> > @@ -2409,6 +2432,7 @@ void
> >  vdropl(struct vnode *vp)
> >  {
> >         struct bufobj *bo;
> > +       struct debug_ref *r, *r1;
> >         struct mount *mp;
> >         int active;
> >
> > @@ -2489,6 +2513,9 @@ vdropl(struct vnode *vp)
> >         lockdestroy(vp->v_vnlock);
> >         mtx_destroy(&vp->v_interlock);
> >         mtx_destroy(BO_MTX(bo));
> > +       TAILQ_FOREACH_SAFE(r, &vp->v_debug_ref, link, r1) {
> > +               free(r, M_RECORD_REF);
> > +       }
> >         uma_zfree(vnode_zone, vp);
> >  }
> >
> > @@ -2888,6 +2915,8 @@ vn_printf(struct vnode *vp, const char *fmt, ...)
> >         va_list ap;
> >         char buf[256], buf2[16];
> >         u_long flags;
> > +       int ref;
> > +       struct debug_ref *r;
> >
> >         va_start(ap, fmt);
> >         vprintf(fmt, ap);
> > @@ -2960,8 +2989,21 @@ vn_printf(struct vnode *vp, const char *fmt, ...)
> >                     vp->v_object->resident_page_count);
> >         printf("    ");
> >         lockmgr_printinfo(vp->v_vnlock);
> > -       if (vp->v_data !=3D NULL)
> > -               VOP_PRINT(vp);
> > +#if DDB
> > +       if (kdb_active) {
> > +               if (vp->v_data !=3D NULL)
> > +                       VOP_PRINT(vp);
> > +       }
> > +#endif
> > +
> > +       /* Getnewvnode() initial reference is not recorded due to VNON =
*/
> > +       ref =3D 1;
> > +       TAILQ_FOREACH(r, &vp->v_debug_ref, link) {
> > +               ref +=3D r->val;
> > +               printf("REF %d %s\n", ref, r->op);
> > +               stack_print(&r->stack);
> > +       }
> > +
> >  }
> >
> >  #ifdef DDB
> > Index: sys/fs/nfsclient/nfs_clport.c
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- sys/fs/nfsclient/nfs_clport.c       (revision 248723)
> > +++ sys/fs/nfsclient/nfs_clport.c       (working copy)
> > @@ -273,6 +273,7 @@ nfscl_nget(struct mount *mntp, struct vnode *dvp,
> >                 /* vfs_hash_insert() vput()'s the losing vnode */
> >                 return (0);
> >         }
> > +       vp->v_vflag |=3D VV_DEBUGVREF;
> >         *npp =3D np;
> >
> >         return (0);
> > Index: sys/fs/nfsclient/nfs_clnode.c
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- sys/fs/nfsclient/nfs_clnode.c       (revision 248723)
> > +++ sys/fs/nfsclient/nfs_clnode.c       (working copy)
> > @@ -179,6 +179,7 @@ ncl_nget(struct mount *mntp, u_int8_t *fhp, int fh
> >                 /* vfs_hash_insert() vput()'s the losing vnode */
> >                 return (0);
> >         }
> > +       vp->v_vflag |=3D VV_DEBUGVREF;
> >         *npp =3D np;
> >
> >         return (0);

--MrbiU6dcJfOZ616B
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.20 (FreeBSD)

iQIcBAEBAgAGBQJR59AsAAoJEJDCuSvBvK1BYsMQAJZj+18pwjeJQfyBoaK4Rif7
FK32mFljb85RxdGs9nX4Pmq91/00B6lv+8HmdlydYbOw4qiRm8x/hNverAr2GHSf
5ArGEJeTCwfXheG+kulivTo+sMrapeyR6XN5THIHjglBjrSBu8nUrAyNzyjaOFRq
2tLDn/NdibMJeBUKkVWMV3L7cmrIw3snF+kJc6f/1iDnBahOKPADxAHo/N1Exg2s
AC5wUuG+d5lrb/jFYaSoND1eDnWIVVu588GQuXrIo9N9GM9D+UVk1OM2pLEOISLM
7hL5mKagLD4wHpzW9FW6nlQjDcGQqJfkvYp+PqsVO6KGaVCk740N3rItDk9WGTfn
WNMGl0i9rDopvuaOBX/BwLrwdN/TQaXTHPdVdOHjDWyjqlmaG2r37AYZ2OOVfvhr
EH2UOAj1bcRfucGOrczxjSRFEI7honiOuw48RYZYNd4WUujSaA61vINdDXIhrYky
/+kTwvGpoBxNvE7pMmUg1fI0Ww/Kp1QsaObh/Kb9KmbOKDSNc8luVfLNbU+EnQaS
bfS9eUUkPvk86lgMLVoXRoXV707IF0r7SBosRgrc9IVc9xZjj4fNqDIvUiZrSoow
hQOPX6X9fZ1DSCngp9DuL4GZocv3levC/8wirpAydi5dTHs2XoP66P8Opn+SBzct
TRwTCYGSbZslIgTHuvMd
=NmSV
-----END PGP SIGNATURE-----

--MrbiU6dcJfOZ616B--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130718112325.GZ5991>