From owner-freebsd-current@freebsd.org Tue Jun 30 19:29:56 2020 Return-Path: Delivered-To: freebsd-current@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 9D2963543DB for ; Tue, 30 Jun 2020 19:29:56 +0000 (UTC) (envelope-from rlibby@gmail.com) Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 49xDvl3ZCgz3bth; Tue, 30 Jun 2020 19:29:55 +0000 (UTC) (envelope-from rlibby@gmail.com) Received: by mail-qt1-f172.google.com with SMTP id d27so16536156qtg.4; Tue, 30 Jun 2020 12:29:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Hg3EXqCwD5sQG8cY8Q0gaErzWPyg/x623S8s7KVNC6Y=; b=bj6CRh5UIAp9Iar9Nig0ps/5oIhmYyaPGjKjtFbaq4/TPj1/oTPvWMUSzkLudfZBlW LFFao78fSikRmJcFVAp2oyJMEWRuZh0htJC3sNXpyLZ0S9ZgyYfnvdRWX0JOWdrK4rmh nk0cqDQKQ9aYSTVfvdVtPEP7QFUo/HA5lmYbWzbXWXJ6u95GhW9UjY2pHT7xYz9UOoBG y1PvtVKHvJiqqVVnW0Zx62PCUz998aMu7S/LcSwI8f8uPAZo5CDFsoJNDfLODoa7m70f M3cE6T9VM48Bsabm4AExGa1S1p4L/FS6Y3pIENfMjmpdIKMEWXNZoZf+dbWA3NurSyM8 P+6g== X-Gm-Message-State: AOAM5316oAHepjOi062SnrdvJ8R0S4LQN3nWlw1txSt+Dq/9kNyh5EjA yNBU0grL1tANB2UnAvnh8yK0H7cg X-Google-Smtp-Source: ABdhPJwfAb23HM2HyX8Qcjm333V5x/h3Zdpgt09t6nO7j2jGpNNGrvQPyv2r1qK58aAvClyUgD8T3A== X-Received: by 2002:ac8:1c09:: with SMTP id a9mr22658130qtk.64.1593545392793; Tue, 30 Jun 2020 12:29:52 -0700 (PDT) Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com. [209.85.222.179]) by smtp.gmail.com with ESMTPSA id k18sm3498967qki.30.2020.06.30.12.29.52 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 30 Jun 2020 12:29:52 -0700 (PDT) Received: by mail-qk1-f179.google.com with SMTP id j80so19798861qke.0; Tue, 30 Jun 2020 12:29:52 -0700 (PDT) X-Received: by 2002:a05:620a:7d6:: with SMTP id 22mr21508708qkb.311.1593545392177; Tue, 30 Jun 2020 12:29:52 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Ryan Libby Date: Tue, 30 Jun 2020 12:29:41 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: r358252 causes intermittent hangs where processes are stuck sleeping on btalloc To: Rick Macklem Cc: Konstantin Belousov , Jeff Roberson , "freebsd-current@freebsd.org" Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 49xDvl3ZCgz3bth X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of rlibby@gmail.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=rlibby@gmail.com X-Spamd-Result: default: False [-2.12 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; RCVD_TLS_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[freebsd.org]; ARC_NA(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.003]; RCVD_COUNT_THREE(0.00)[4]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-0.15)[-0.145]; RCVD_IN_DNSWL_NONE(0.00)[209.85.160.172:from]; NEURAL_HAM_MEDIUM(-0.97)[-0.974]; FORGED_SENDER(0.30)[rlibby@freebsd.org,rlibby@gmail.com]; RWL_MAILSPIKE_POSSIBLE(0.00)[209.85.160.172:from]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US]; FROM_NEQ_ENVFROM(0.00)[rlibby@freebsd.org,rlibby@gmail.com]; FREEMAIL_ENVFROM(0.00)[gmail.com] X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 30 Jun 2020 19:29:56 -0000 On Sun, Jun 28, 2020 at 9:57 PM Rick Macklem wrote: > > Just in case you were waiting for another email, I have now run several > cycles of the kernel build over NFS on a recent head kernel with the > one line change and it has not hung. > > I don't know if this is the correct fix, but it would be nice to get something > into head to fix this. > > If I don't hear anything in the next few days, I'll put it in a PR so it > doesn't get forgotten. > > rick Thanks for the follow through on this. I think the patch is not complete. It looks like the problem is that for systems that do not have UMA_MD_SMALL_ALLOC, we do uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc); but we haven't set an appropriate free function. This is probably why UMA_ZONE_NOFREE was originally there. When NOFREE was removed, it was appropriate for systems with uma_small_alloc. So by default we get page_free as our free function. That calls kmem_free, which calls vmem_free ... but we do our allocs with vmem_xalloc. I'm not positive, but I think the problem is that in effect we vmem_xalloc -> vmem_free, not vmem_xfree. Three possible fixes: 1: The one you tested, but this is not best for systems with uma_small_alloc. 2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC. 3: Actually provide an appropriate vmem_bt_free function. I think we should just do option 2 with a comment, it's simple and it's what we used to do. I'm not sure how much benefit we would see from option 3, but it's more work. Ryan > > ________________________________________ > From: owner-freebsd-current@freebsd.org on behalf of Rick Macklem > Sent: Thursday, June 18, 2020 11:42 PM > To: Ryan Libby > Cc: Konstantin Belousov; Jeff Roberson; freebsd-current@freebsd.org > Subject: Re: r358252 causes intermittent hangs where processes are stuck sleeping on btalloc > > Ryan Libby wrote: > >On Mon, Jun 15, 2020 at 5:06 PM Rick Macklem wrote: > >> > >> Rick Macklem wrote: > >> >r358098 will hang fairly easily, in 1-3 cycles of the kernel build over = > NFS. > >> >I thought this was the culprit, since I did 6 cycles of r358097 without = > a hang. > >> >However, I just got a hang with r358097, but it looks rather different. > >> >The r358097 hang did not have any processes sleeping on btalloc. They > >> >appeared to be waiting on two different locks in the buffer cache. > >> >As such, I think it might be a different problem. (I'll admit I should h= > ave > >> >made notes about this one before rebooting, but I was flustrated that > >> >it happened and rebooted before looking at it mush detail.) > >> Ok, so I did 10 cycles of the kernel build over NFS for r358096 and never > >> got a hang. > >> --> It seems that r358097 is the culprit and r358098 makes it easier > >> to reproduce. > >> --> Basically runs out of kernel memory. > >> > >> It is not obvious if I can revert these two commits without reverting > >> other ones, since there were a bunch of vm changes after these. > >> > >> I'll take a look, but if you guys have any ideas on how to fix this, plea= > se > >> let me know. > >> > >> Thanks, rick > > > >Interesting. Could you try re-adding UMA_ZONE_NOFREE to the vmem btag > >zone to see if that rescues it, on whatever base revision gets you a > >reliable repro? > Good catch! That seems to fix it. I've done 8 cycles of kernel build over > NFS without a hang (normally I'd get one in the first 1-3 cycles). > > I don't know if the intend was to delete UMA_ZONE_VM and r358097 > had a typo in it and deleted UMA_ZONE_NOFREE or ??? > > Anyhow, I just put it back to UMA_ZONE_VM | UMA_ZONE_NOFREE and > the hangs seem to have gone away. > > The small patch I did is attached, in case that isn't what you meant. > > I'll run a few more cycles just in case, but I think this fixes it. > > Thanks, rick > > > > > Jeff, to fill you in, I have been getting intermittent hangs on a Pentium= > 4 > > (single core i386) with 1.25Gbytes ram when doing kernel builds using > > head kernels from this winter. (I also saw one when doing a kernel build > > on UFS, so they aren't NFS specific, although easier to reproduce that wa= > y.) > > After a typical hang, there will be a bunch of processes sleeping on "bta= > lloc" > > and several processes holding the following lock: > > exclusive sx lock @ vm/vm_map.c:4761 > > - I have seen hangs where that is the only lock held by any process excep= > t > > the interrupt thread. > > - I have also seen processes waiting on the following locks: > > kern/subr_vmem.c:1343 > > kern/subr_vmem.c:633 > > > > I can't be absolutely sure r358098 is the culprit, but it seems to make t= > he > > problem more reproducible. > > > > If anyone has a patch suggestion, I can test it. > > Otherwise, I will continue to test r358097 and earlier, to try and see wh= > at hangs > > occur. (I've done 8 cycles of testing of r356776 without difficulties, bu= > t that > > doesn't guarantee it isn't broken.) > > > > There is a bunch more of the stuff I got for Kostik and Ryan below. > > I can do "db" when it is hung, but it is a screen console, so I need to > > transcribe the output to email by hand. (ie. If you need something > > specific I can do that, but trying to do everything Kostik and Ryan asked > > for isn't easy.) > > > > rick > > > > > > > > Konstantin Belousov wrote: > > >On Fri, May 22, 2020 at 11:46:26PM +0000, Rick Macklem wrote: > > >> Konstantin Belousov wrote: > > >> >On Wed, May 20, 2020 at 11:58:50PM -0700, Ryan Libby wrote: > > >> >> On Wed, May 20, 2020 at 6:04 PM Rick Macklem = > wrote: > > >> >> > > > >> >> > Hi, > > >> >> > > > >> >> > Since I hadn't upgraded a kernel through the winter, it took me a= > while > > >> >> > to bisect this, but r358252 seems to be the culprit. > > No longer true. I succeeded in reproducing the hang to-day running a > > r358251 kernel. > > > > I haven't had much luck sofar, but see below for what I have learned. > > > > >> >> > > > >> >> > If I do a kernel build over NFS using my not so big Pentium 4 (si= > ngle core, > > >> >> > 1.25Gbytes RAM, i386), about every second attempt will hang. > > >> >> > When I do a "ps" in the debugger, I see processes sleeping on bta= > lloc. > > >> >> > If I revert to r358251, I cannot reproduce this. > > As above, this is no longer true. > > > > >> >> > > > >> >> > Any ideas? > > >> >> > > > >> >> > I can easily test any change you might suggest to see if it fixes= > the > > >> >> > problem. > > >> >> > > > >> >> > If you want more debug info, let me know, since I can easily > > >> >> > reproduce it. > > >> >> > > > >> >> > Thanks, rick > > >> >> > > >> >> Nothing obvious to me. I can maybe try a repro on a VM... > > >> >> > > >> >> ddb ps, acttrace, alltrace, show all vmem, show page would be welco= > me. > > >> >> > > >> >> "btalloc" is "We're either out of address space or lost a fill race= > ." > > From what I see, I think it is "out of address space". > > For one of the hangs, when I did "show alllocks", everything except the > > intr thread, was waiting for the > > exclusive sx lock @ vm/vm_map.c:4761 > > > > >> > > > >> >Yes, I would be not surprised to be out of something on 1G i386 machi= > ne. > > >> >Please also add 'show alllocks'. > > >> Ok, I used an up to date head kernel and it took longer to reproduce a= > hang. > > Go down to Kostik's comment about kern.maxvnodes for the rest of what I'v= > e > > learned. (The time it takes to reproduce one of these varies greatly, but= > I usually > > get one within 3 cycles of a full kernel build over NFS. I have had it ha= > ppen > > once when doing a kernel build over UFS.) > > > > >> This time, none of the processes are stuck on "btalloc". > > > I'll try and give you most of the above, but since I have to type it in= > by hand > > > from the screen, I might not get it all. (I'm no real typist;-) > > > > show alllocks > > > exclusive lockmgr ufs (ufs) r =3D 0 locked @ kern/vfs_subr.c: 3259 > > > exclusive lockmgr nfs (nfs) r =3D 0 locked @ kern/vfs_lookup.c:737 > > > exclusive sleep mutex kernel area domain (kernel arena domain) r =3D 0 = > locked @ kern/subr_vmem.c:1343 > > > exclusive lockmgr bufwait (bufwait) r =3D 0 locked @ kern/vfs_bio.c:166= > 3 > > > exclusive lockmgr ufs (ufs) r =3D 0 locked @ kern/vfs_subr.c:2930 > > > exclusive lockmgr syncer (syncer) r =3D 0 locked @ kern/vfs_subr.c:2474 > > > Process 12 (intr) thread 0x.. (1000008) > > > exclusive sleep mutex Giant (Giant) r =3D 0 locked @ kern/kern_intr.c:1= > 152 > > > > > > > ps > > > - Not going to list them all, but here are the ones that seem interesti= > ng... > > > 18 0 0 0 DL vlruwt 0x11d939cc [vnlru] > > > 16 0 0 0 DL (threaded) [bufdaemon] > > > 100069 D qsleep [bufdaemon] > > > 100074 D - [bufspacedaemon-0] > > > 100084 D sdflush 0x11923284 [/ worker] > > > - and more of these for the other UFS file systems > > > 9 0 0 0 DL psleep 0x1e2f830 [vmdaemon] > > > 8 0 0 0 DL (threaded) [pagedaemon] > > > 100067 D psleep 0x1e2e95c [dom0] > > > 100072 D launds 0x1e2e968 [laundry: dom0] > > > 100073 D umarcl 0x12cc720 [uma] > > > =E2=80=A6 a bunch of usb and cam ones > > > 100025 D - 0x1b2ee40 [doneq0] > > > =E2=80=A6 > > > 12 0 0 0 RL (threaded) [intr] > > > 100007 I [swi6: task queue] > > > 100008 Run CPU 0 [swi6: Giant taskq] > > > =E2=80=A6 > > > 100000 D swapin 0x1d96dfc [swapper] > > > - and a bunch more in D state. > > > Does this mean the swapper was trying to swap in? > > > > > > > acttrace > > > - just shows the keyboard > > > kdb_enter() at kdb_enter+0x35/frame > > > vt_kbdevent() at vt_kdbevent+0x329/frame > > > kdbmux_intr() at kbdmux_intr+0x19/frame > > > taskqueue_run_locked() at taskqueue_run_locked+0x175/frame > > > taskqueue_run() at taskqueue_run+0x44/frame > > > taskqueue_swi_giant_run(0) at taskqueue_swi_giant_run+0xe/frame > > > ithread_loop() at ithread_loop+0x237/frame > > > fork_exit() at fork_exit+0x6c/frame > > > fork_trampoline() at 0x../frame > > > > > > > show all vmem > > > vmem 0x.. 'transient arena' > > > quantum: 4096 > > > size: 23592960 > > > inuse: 0 > > > free: 23592960 > > > busy tags: 0 > > > free tags: 2 > > > inuse size free size > > > 16777216 0 0 1 23592960 > > > vmem 0x.. 'buffer arena' > > > quantum: 4096 > > > size: 94683136 > > > inuse: 94502912 > > > free: 180224 > > > busy tags: 1463 > > > free tags: 3 > > > inuse size free size > > > 16384 2 32768 1 16384 > > > 32768 39 1277952 1 32768 > > > 65536 1422 93192192 0 0 > > > 131072 0 0 1 131072 > > > vmem 0x.. 'i386trampoline' > > > quantum: 1 > > > size: 24576 > > > inuse: 20860 > > > free: 3716 > > > busy tags: 9 > > > free tags: 3 > > > inuse size free size > > > 32 1 48 1 52 > > > 64 2 208 0 0 > > > 128 2 280 0 0 > > > 2048 1 2048 1 3664 > > > 4096 2 8192 0 0 > > > 8192 1 10084 0 0 > > > vmem 0x.. 'kernel rwx arena' > > > quantum: 4096 > > > size: 0 > > > inuse: 0 > > > free: 0 > > > busy tags: 0 > > > free tags: 0 > > > vmem 0x.. 'kernel area dom' > > > quantum: 4096 > > > size: 56623104 > > > inuse: 56582144 > > >> free: 40960 > > >> busy tags: 11224 > > >> free tags: 3 > > >I think this is the trouble. > > > > > >Did you tried to reduce kern.maxvnodes ? What is the default value for > > >the knob on your machine ? > > The default is 84342. > > I have tried 64K, 32K and 128K and they all hung sooner or later. > > For the 32K case, I did see vnodes being recycled for a while before it g= > ot hung, > > so it isn't just when it hits the limit. > > > > Although it is much easier for me to reproduce on an NFS mount, I did see > > a hang while doing a kernel build on UFS (no NFS mount on the machine at > > that time). > > > > So, I now know that the problem pre-dates r358252 and is not NFS specific= > . > > > > I'm not bisecting back further to try and isolate the commit that causes = > this. > > (Unfortunately, each test cycle can take days. I now know that I have to = > do > > several of these kernel builds, which take hours each, to see if a hang i= > s going > > to happen.) > > > > I'll post if/when I have more, rick > > > > We scaled maxvnodes for ZFS and UFS, might be NFS is even more demanding, > > having larger node size. > > > > > inuse size free size > > > 4096 11091 45428736 0 0 > > > 8192 63 516096 0 0 > > > 16384 12 196608 0 0 > > > 32768 6 196608 0 0 > > > 40960 2 81920 1 40960 > > > 65536 16 1048576 0 0 > > > 94208 1 94208 0 0 > > > 110592 1 110592 0 0 > > > 131072 15 2441216 0 0 > > > 262144 15 3997696 0 0 > > > 524288 1 524288 0 0 > > > 1048576 1 1945600 0 0 > > > vmem 0x.. 'kernel arena' > > > quantum: 4096 > > > size: 390070272 > > > inuse: 386613248 > > > free: 3457024 > > > busy tags: 873 > > > free tags: 3 > > > inuse size free size > > > 4096 35 143360 1 4096 > > > 8192 2 16384 2 16384 > > > 12288 1 12288 0 0 > > > 16384 30 491520 0 0 > > > 20480 140 2867200 0 0 > > > 65536 1 65536 0 0 > > > 131072 631 82706432 0 0 > > > 1048576 0 0 1 1339392 > > > 2097152 27 56623104 1 2097152 > > > 8388608 1 13774848 0 0 > > > 16777216 3 74883072 0 0 > > > 33554432 1 36753408 0 0 > > > 67108864 1 118276096 0 0 > > > > > > > alltrace > > > - I can't face typing too much more, but I'll put a few > > > here that look interesting > > > > > > - for csh > > > sched_switch() > > > mi_switch() > > > kern_yield() > > > getblkx() > > > breadn_flags() > > > ffs_update() > > > ufs_inactive() > > > VOP_INACTIVE() > > > vinactivef() > > > vput_final() > > > vm_object_deallocate() > > > vm_map_process_deferred() > > > kern_munmap() > > > sys_munmap() > > > > > > - For cc > > > sched_switch() > > > mi_switch() > > > sleepq_switch() > > > sleepq_timedwait() > > > _sleep() > > > pause_sbt() > > > vmem_bt_alloc() > > > keg_alloc_slab() > > > zone_import() > > > cache_alloc() > > > cache_alloc_retry() > > > uma_zalloc_arg() > > > bt_fill() > > > vmem_xalloc() > > > vmem_alloc() > > > kmem_alloc() > > > kmem_malloc_domainset() > > > page_alloc() > > > keg_alloc_slab() > > > zone_import() > > > cache_alloc() > > > cache_alloc_retry() > > > uma_zalloc_arg() > > > nfscl_nget() > > > nfs_create() > > > vop_sigdefer() > > > nfs_vnodeops_bypass() > > > VOP_CREATE_APV() > > > vn_open_cred() > > > vn_open() > > > kern_openat() > > > sys_openat() > > > > > > Then there are a bunch of these... for cc, make > > > sched_switch() > > > mi_switch() > > > sleepq_switch() > > > sleepq_catch_signals() > > > sleepq_wait_sig() > > > kern_wait6() > > > sys_wait4() > > > > > > - for vnlru > > > sched_switch() > > > mi_switch() > > > sleepq_switch() > > > sleepq_timedwait() > > > _sleep() > > > vnlru_proc() > > > fork_exit() > > > fork_trampoline() > > > > > > - for syncer > > > sched_switch() > > > mi_switch() > > > critical_exit_preempt() > > > intr_event_handle() > > > intr_execute_handlers() > > > lapic_handle_intr() > > > Xapic_isr1() > > > - interrupt > > > memset() > > > cache_alloc() > > > cache_alloc_retry() > > > uma_zalloc_arg() > > > vmem_xalloc() > > > vmem_bt_alloc() > > > keg_alloc_slab() > > > zone_import() > > > cache_alloc() > > > cache_alloc_retry() > > > uma_zalloc_arg() > > > bt_fill() > > > vmem_xalloc() > > > vmem_alloc() > > > bufkva_alloc() > > > getnewbuf() > > > getblkx() > > > breadn_flags() > > > ffs_update() > > > ffs_sync() > > > sync_fsync() > > > VOP_FSYNC_APV() > > > sched_sync() > > > fork_exit() > > > fork_trampoline() > > > > > > - For bufdaemon (a bunch of these) > > > sched_switch() > > > mi_switch() > > > sleepq_switch() > > > sleepq_timedwait() > > > _sleep() > > > buf_daemon() > > > fork_exit() > > > fork_trampoline() > > > > > > vmdaemon and pagedaemon are basically just like above, > > > sleeping in > > > vm_daemon() > > > or > > > vm_pageout_worker() > > > or > > > vm_pageout_laundry_worker() > > > or > > > uma_reclaim_worker() > > > > > > That's all the typing I can take right now. > > > I can probably make this happen again if you want more specific stuff. > > > > > > rick > > > > > > > > > > > > > > _______________________________________________ > > freebsd-current@freebsd.org mailing list > > https://lists.freebsd.org/mailman/listinfo/freebsd-current > > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org= > " > > _______________________________________________ > > freebsd-current@freebsd.org mailing list > > https://lists.freebsd.org/mailman/listinfo/freebsd-current > > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org= > " > > _______________________________________________ > > freebsd-current@freebsd.org mailing list > > https://lists.freebsd.org/mailman/listinfo/freebsd-current > > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org