From owner-freebsd-stable Fri Sep 21 22:17:59 2001 Delivered-To: freebsd-stable@freebsd.org Received: from mail-green.research.att.com (H-135-207-30-103.research.att.com [135.207.30.103]) by hub.freebsd.org (Postfix) with ESMTP id E79CC37B412 for ; Fri, 21 Sep 2001 22:17:49 -0700 (PDT) Received: from alliance.research.att.com (alliance.research.att.com [135.207.26.26]) by mail-green.research.att.com (Postfix) with ESMTP id 1DD821E018; Sat, 22 Sep 2001 01:17:49 -0400 (EDT) Received: from chips.research.att.com (chips.research.att.com [135.207.27.139]) by alliance.research.att.com (8.8.7/8.8.7) with ESMTP id BAA27518; Sat, 22 Sep 2001 01:17:45 -0400 (EDT) Received: (from chuck@localhost) by chips.research.att.com (SGI-8.9.3/8.8.5) id BAA12582; Sat, 22 Sep 2001 01:17:45 -0400 (EDT) Date: Sat, 22 Sep 2001 01:17:44 -0400 From: Chuck Cranor To: freebsd-stable@FreeBSD.ORG Cc: Chuck Cranor Subject: why my 4.4-RELEASE kernel deadlocks Message-ID: <20010922011744.B109536@chips.research.att.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Organization: AT&T Labs-Research Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG hi- i have debugged the problem, here is the scoop: Background: every allocated FFS vnode has a private memory area allocated as malloc type "FFS node"... each FFS node allocation takes a 512 byte block out of the kernel memory allocator. this allocation occurs in the function ffs_vget() (see file sys/ufs/ffs/ffs_vfsops.c, look for MALLOC). the kernel memory allocator limits the number of memory allocations of each type to 102400K. if you try and allocate more than that, the kernel may sleep waiting for memory of that type to become free. you can see the current allocation with "vmstat -m" (note MemUse and Limit): Memory statistics by type Type Kern Type InUse MemUse HighUse Limit Requests Limit Limit Size(s) FFS node 334 167K 167K102400K 1170 0 0 512 if each "FFS node" allocation takes 512 bytes, then you can have at most 102400K/512 (i.e. 204800) nodes allocated before the kernel malloc refuses to allocate any more nodes. now consider when vnodes are allocated and freed. looking at file sys/kern/vfs_subr.c, the function getnewvnode() allocates new vnodes. it attempts to recycle a free vnode that has already been allocated before it allocates a new one. free vnodes are stored on the global list vnode_free_list and counted by the global "freevnodes". you can see the current value of freevnodes using the command "sysctl -a | grep freevnodes"... if there are not enough free vnodes, then the system allocates a new vnode using zalloc(vnode_zone). vnodes are freed by the vrele() function in the same file, but only if their v_usecount is going to drop to zero and VSHOULDFREE() is true. looking at sys/sys/vnode.h, VSHOULDFREE is true if the neither VFREE or VDOOMED flags are set, the hold count is zero, the use count is zero, and the vm_object's reference count and page count is zero (if there is a vm_object allocated). Problem: - vrele() does not free vnodes that have resident pages of memory associated with them (VSHOULDFREE will be false). - there is a global limit on the number of allocated vnodes on the system. this is in the global int "desiredvnodes" which shows up as "kern.maxvnodes" in "sysctl -a" output. ** FreeBSD kernel basically ignores kern.maxvnodes ** - the kernel will free an inactive vnode if all it's pages of memory in its vm_object become non-resident (e.g. paged out). -> if you have a system with a large amount of RAM (e.g. >800MB) it is possible for any user to create enough vnodes to fill the "FFS node" kernel malloc area and deadlock the system. the key is to create a large number of inactive vnodes each of which has a very small number of pages associated with (ideal: 1 page per vnode). [hint: think about "cvs co -AP ports"] here is an example program that creates a specified number of small files: /* * try.c chuck@research.att.com * will deadlock FreeBSD kernel if system has lots of RAM and mx is large * enough (e.g. >204800). tested on system with 2GB RAM. */ #include #include #include main(int argc, char **argv) { int psz = getpagesize(); char *buf; char fn[512]; int fd, lcv, mx; if (argc != 2) errx(1, "usage: try number-of-files"); mx = atoi(argv[1]); if (mx < 1) errx(1, "usage: try number-of-files"); buf = malloc(psz); if (!buf) err(1, "malloc"); bzero(buf, psz); strcpy(buf, "hi there!\n"); /* create directories for a lot of files */ for (lcv = 0 ; lcv < mx ; lcv += 1000) { sprintf(fn, "%d.d", lcv/1000); if (mkdir(fn, 0777) < 0) err(1, "mkdir"); } /* create alot of 1 page files */ for (lcv = 0 ; lcv < mx ; lcv++) { if ((lcv % 1000) == 0) { printf("%dk ", lcv / 1000); fflush(stdout); } sprintf(fn, "%d.d/%d.dat", lcv/1000, lcv%1000); fd = open(fn, O_CREAT|O_RDWR, 0666); if (fd < 0) err(1, "open"); if (write(fd, buf, psz) != psz) err(1, "write"); close(fd); } } now here is a script of me using it. note that before i run it, debug.numvnodes is less than kern.maxvnodes. when it finishes, note that debug.numvnodes is much greater than kern.maxvnodes and the kernel memory allocation for "FFS node" reported by "vmstat -m" is quite large (i didn't run it all the way to deadlock): Script started on Sat Sep 22 00:10:05 2001 cdn3> sysctl -a | egrep 'maxvnodes|numvnodes|freevnodes' kern.maxvnodes: 129183 debug.numvnodes: 340 debug.wantfreevnodes: 25 debug.freevnodes: 24 cdn3> ./try 150000 0k 1k 2k 3k 4k 5k 6k 7k 8k 9k 10k 11k 12k 13k 14k 15k 16k 17k 18k 19k 20k 21k 22k 23k 24k 25k 26k 27k 28k 29k 30k 31k 32k 33k 34k 35k 36k 37k 38k 39k 40k 41k 42k 43k 44k 45k 46k 47k 48k 49k 50k 51k 52k 53k 54k 55k 56k 57k 58k 59k 60k 61k 62k 63k 64k 65k 66k 67k 68k 69k 70k 71k 72k 73k 74k 75k 76k 77k 78k 79k 80k 81k 82k 83k 84k 85k 86k 87k 88k 89k 90k 91k 92k 93k 94k 95k 96k 97k 98k 99k 100k 101k 102k 103k 104k 105k 106k 107k 108k 109k 110k 111k 112k 113k 114k 115k 116k 117k 118k 119k 120k 121k 122k 123k 124k 125k 126k 127k 128k 129k 130k 131k 132k 133k 134k 135k 136k 137k 138k 139k 140k 141k 142k 143k 144k 145k 146k 147k 148k 149k cdn3> sysctl -a | egrep 'maxvnodes|numvnodes|freevnodes' kern.maxvnodes: 129183 debug.numvnodes: 150334 debug.wantfreevnodes: 25 debug.freevnodes: 25 cdn3> vmstat -m | grep 'FFS no' 512 ATA generic, UFS mount, FFS node, ifaddr, mount, BIO buffer, USBdev, FFS node150331 75166K 75166K102400K 151811 0 0 512 cdn3> cdn3> now, if there are enough other things trying to get RAM on the system, then they will cause the RAM for the small files we've created to be paged out and reallocated. if all the pages are removed from a vnode, then it gets freed. thus, an active system is less likely to deadlock because other users will be pushing these vnodes out of RAM. here is a simple program that allocates 1GB of RAM: /* * mzero.c chuck@research.att.com */ #include #include #define GB (1*1024*1024*1024) main() { void *p; int c; p = malloc(GB); if (!p) err(1, "malloc"); printf("malloc done, bzeroing....\n"); bzero(p, GB); printf("bzero done\n"); printf("hit ... "); c = getchar(); exit(0); } watch what happens to debug.freevnodes when i run this program: cdn3> ./mzero & [1] 298 cdn3> malloc done, bzeroing.... bzero done hit ... [1] + Suspended (tty input) ./mzero cdn3> sysctl -a | egrep 'maxvnodes|numvnodes|freevnodes' kern.maxvnodes: 129183 debug.numvnodes: 150336 debug.wantfreevnodes: 25 debug.freevnodes: 43800 cdn3> ./mzero & [2] 301 cdn3> malloc done, bzeroing.... bzero done hit ... [2] + Suspended (tty input) ./mzero cdn3> sysctl -a | egrep 'maxvnodes|numvnodes|freevnodes' kern.maxvnodes: 129183 debug.numvnodes: 150336 debug.wantfreevnodes: 25 debug.freevnodes: 146820 cdn3> vmstat -m | grep 'FFS no' 512 ATA generic, UFS mount, FFS node, ifaddr, mount, BIO buffer, USBdev, FFS node150331 75166K 75166K102400K 151811 0 0 512 cdn3> the vmstat shows the final memory usage in this test. most of the 75166K is on the free list (note debug.freevnodes=146820). Fix: you could hack around the problem by increasing the size of the FFS node malloc area. however, i believe it is wrong for the FreeBSD kernel to ignore the value of kern.maxvnodes. the kernel needs to be smarter about how it recycles vnodes when it reaches the kern.maxvnodes limit. specifically, it should reclaim some of the inactive vnodes that have pages of memory associated with them. chuck To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message