Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 29 Jul 2003 23:11:30 +0200
From:      Poul-Henning Kamp <phk@phk.freebsd.dk>
To:        jeffr@freebsd.org
Cc:        current@freebsd.org
Subject:   HEADSUP: UMA not reentrant / possible memory leak
Message-ID:  <88569.1059513090@critter.freebsd.dk>

next in thread | raw e-mail | index | archive | help

[I'm CC'ing current because this seems to have a significant negative
impact on -current kernel stability, and we can use some more data,
in particular on non-i386 SMP machines]

Thanks to Lukas Ertl and Bosko we have found a clear indication that
UMA is in fact not reentrant (enough).

The indication of this is that the g_bio zone does not return to
zero USED as it should.

The attached patch adds an atomic counter in GEOM to count the number
of actually used items in the sysctl variable debug.ngbio.

Here is a typical output from my SMP box:

bang# sh a.sh
g_bio:           144,        0,     35,     77,     4281
debug.ngbio: 0
10:58PM  up 36 secs, 1 user, load averages: 0.65, 0.20, 0.07
g_bio:           144,        0,     66,    102,     5917
debug.ngbio: 0
10:58PM  up 56 secs, 3 users, load averages: 0.46, 0.18, 0.07
g_bio:           144,        0,     69,     99,    12352
debug.ngbio: 0
10:59PM  up 1 min, 3 users, load averages: 0.56, 0.22, 0.09
g_bio:           144,        0,    185,    123,    20023
debug.ngbio: 0
10:59PM  up 2 mins, 3 users, load averages: 0.62, 0.25, 0.10
g_bio:           144,        0,    227,     81,    28259
debug.ngbio: 0
10:59PM  up 2 mins, 3 users, load averages: 0.64, 0.28, 0.11
g_bio:           144,        0,    222,     86,    32256
debug.ngbio: 0
11:00PM  up 2 mins, 3 users, load averages: 0.74, 0.33, 0.13

Notice that the USED column fluctuates both up and down.  Other
machines are able to reproduce negative USED counts.

As you can see in the patch I have added a mutex around the zone
operations in order to see if that solved the issue, and it doesn't
seem to make any difference at all.

I am unable to tell if it is just the UMA zone statistics which
are f**ked up, or if the "important" data structures in UMA are
also victims of this.  The machines which Lukas and Bosko work
on seem to die after some short period of time, and this could
indicate that this is not just statistics being b0rked.

We see this problem also on GCC 3.2.2 machines.

HELP!

Poul-Henning

Index: geom_io.c
===================================================================
RCS file: /home/ncvs/src/sys/geom/geom_io.c,v
retrieving revision 1.44
diff -u -r1.44 geom_io.c
--- geom_io.c	18 Jun 2003 10:33:09 -0000	1.44
+++ geom_io.c	29 Jul 2003 20:51:55 -0000
@@ -39,6 +39,7 @@
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
+#include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <sys/bio.h>
 
@@ -55,6 +56,12 @@
 static u_int pace;
 static uma_zone_t	biozone;
 
+struct mtx gbiomutex;
+static int ngbio;
+SYSCTL_INT(_debug, OID_AUTO, ngbio, CTLFLAG_RD,
+    &ngbio, 0, "");
+
+
 #include <machine/atomic.h>
 
 static void
@@ -116,15 +123,26 @@
 {
 	struct bio *bp;
 
+	mtx_lock(&gbiomutex);
 	bp = uma_zalloc(biozone, M_NOWAIT | M_ZERO);
+	mtx_unlock(&gbiomutex);
+	if (bp != NULL)
+		atomic_add_int(&ngbio, 1);
 	return (bp);
 }
 
 void
 g_destroy_bio(struct bio *bp)
 {
-
+	if (bp == NULL) {
+		printf("g_destroy_bio(NULL)");
+		Debugger("foo");
+		return;
+	}
+	mtx_lock(&gbiomutex);
 	uma_zfree(biozone, bp);
+	mtx_unlock(&gbiomutex);
+	atomic_add_int(&ngbio, -1);
 }
 
 struct bio *
@@ -132,8 +150,11 @@
 {
 	struct bio *bp2;
 
+	mtx_lock(&gbiomutex);
 	bp2 = uma_zalloc(biozone, M_NOWAIT | M_ZERO);
+	mtx_unlock(&gbiomutex);
 	if (bp2 != NULL) {
+		atomic_add_int(&ngbio, 1);
 		bp2->bio_parent = bp;
 		bp2->bio_cmd = bp->bio_cmd;
 		bp2->bio_length = bp->bio_length;
@@ -304,6 +325,7 @@
  
 	bzero(&mymutex, sizeof mymutex);
 	mtx_init(&mymutex, "g_xdown", MTX_DEF, 0);
+	mtx_init(&gbiomutex, "gbio", MTX_DEF, 0);
 
 	for(;;) {
 		g_bioq_lock(&g_bio_run_down);


-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?88569.1059513090>