From owner-freebsd-current@freebsd.org Sat Oct 15 16:18:53 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2DF6AC134EB for ; Sat, 15 Oct 2016 16:18:53 +0000 (UTC) (envelope-from uqs@FreeBSD.org) Received: from acme.spoerlein.net (acme.spoerlein.net [IPv6:2a02:2528:fa:1000::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.spoerlein.net", Issuer "CAcert Class 3 Root" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id B20D1A65 for ; Sat, 15 Oct 2016 16:18:52 +0000 (UTC) (envelope-from uqs@FreeBSD.org) Received: from localhost (acme.spoerlein.net [IPv6:2a02:2528:fa:1000::1]) by acme.spoerlein.net (8.15.2/8.15.2) with ESMTPS id u9FGImI0052077 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Sat, 15 Oct 2016 18:18:48 +0200 (CEST) (envelope-from uqs@FreeBSD.org) Date: Sat, 15 Oct 2016 18:18:48 +0200 From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= To: freebsd-current@freebsd.org Subject: FreeBSD 11.x grinds to a halt after about 48h of uptime Message-ID: <20161015161848.GD2532@acme.spoerlein.net> Mail-Followup-To: freebsd-current@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.6.0 (2016-04-01) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Oct 2016 16:18:53 -0000 Hey all, while 11.x is -STABLE now, this happens to my machine ever since I upgraded it to 11-CURRENT years ago. I have no idea when this started, actually, but what always happens is this: - System and X11 is up and running, I keep it running over night as I'm too lazy to reboot and restart everthing. - There's a bunch of xterms, Chrome, Clementine-Player and some other programs running - Coming back to the machine the next day (or the day after) it will exit the screensaver just fine and then either I can use it for a couple of seconds before it freezes, or it's pretty much dead already. The mouse cursor still moves for a bit, but the also freezes (so it this a GPU problem??) Now what I currently see on the screen is a clock widget stuck at 18:04 but conky itself has last updated at 18:00:18 ... This time I had some SSH sessions from another machine to see some more useful things. There was nothing in various logs under /var/log (I also can't run dmesg anymore ...) I had top(1) running in a loop, this is the last output: last pid: 25633; load averages: 0.27, 0.39, 0.36 up 1+23:03:28 18:00:12 202 processes: 2 running, 188 sleeping, 11 zombie, 1 waiting Mem: 8873M Active, 1783M Inact, 5072M Wired, 567M Buf, 132M Free ARC: 1844M Total, 469M MFU, 268M MRU, 16K Anon, 96M Header, 1012M Other Swap: 4096M Total, 2395M Used, 1701M Free, 58% Inuse PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 11 root 8 155 ki31 0K 128K CPU0 0 364.6H 772.95% idle 3122 uqs 15 28 0 7113M 5861M uwait 0 94:44 13.96% chrome 2887 uqs 28 22 0 1394M 237M select 2 172:53 6.98% chrome 2890 uqs 11 21 0 1034M 178M select 5 231:21 1.95% chrome 1062 root 9 21 0 440M 47220K select 0 67:09 0.98% Xorg 3002 uqs 15 25 5 1159M 172M uwait 2 19:09 0.00% chrome 3139 uqs 17 25 5 1163M 156M uwait 2 16:15 0.00% chrome 3001 uqs 18 25 5 1639M 575M uwait 0 16:05 0.00% chrome 12 root 24 -64 - 0K 384K WAIT -1 10:53 0.00% intr 3129 uqs 12 20 0 2820M 1746M uwait 6 8:36 0.00% chrome 2822 uqs 9 20 0 217M 81300K select 0 5:10 0.00% conky 3174 root 1 20 0 21532K 3188K select 0 4:20 0.00% systat 3130 uqs 16 20 0 1058M 131M uwait 4 3:03 0.00% chrome 2998 uqs 16 20 0 1110M 123M uwait 2 2:53 0.00% chrome 3165 uqs 10 20 0 1209M 215M uwait 6 2:52 0.00% chrome 3142 uqs 11 25 5 1344M 195M uwait 2 2:46 0.00% chrome 2876 uqs 19 20 0 580M 37164K select 3 2:42 0.00% clementine-player 20 root 2 -16 - 0K 32K psleep 6 2:25 0.00% pagedaemon I also had systat -vm running and it continued to update its screen ... for a short while, this is the last update before SSH died: Mem usage: 0k%Phy 5%Kmem Mem: KB REAL VIRTUAL VN PAGER SWAP PAGER Tot Share Tot Share Free in out in out Act 11051k 67868 71051992 255448 61840 count All 11051k 67924 71058776 262100 pages Proc: Interrupts r p d s w Csw Trp Sys Int Sof Flt ioflt 224 total 25 730 11 724 109 404 101 13 cow 2 ehci0 16 zfod 3 ehci1 23 0.0%Sys 0.1%Intr 0.0%User 0.0%Nice 99.9%Idle ozfod 16 cpu0:timer | | | | | | | | | | %ozfod xhci0 264 daefr 3 em0 265 50 dtbuf prcfr 94 hdac1 266 Namei Name-cache Dir-cache 349167 desvn totfr ahci0 270 Calls hits % hits % 349155 numvn react 5 cpu1:timer 121 121 100 253501 frevn pdwak 1 cpu2:timer pdpgs 29 cpu7:timer Disks md0 ada0 ada1 pass0 pass1 pass2 intrn 12 cpu3:timer KB/t 0.00 0.00 0.00 0.00 0.00 0.00 5318892 wire 41 cpu6:timer tps 0 0 0 0 0 0 9261404 act 12 cpu5:timer MB/s 0.00 0.00 0.00 0.00 0.00 0.00 1598184 inact 6 cpu4:timer %busy 0 0 0 0 0 0 cache vgapci0 61840 free 712304 buf Why do I have a Chrome tab using about 6G? What other sort of debugging output can be helpful to get to the bottom of this? The machine still responds to pings just fine, TCP connections get set up but the SSH handshake never completes. This always happens between 30-50h and is super annoying and has been going on for >1year. Help? Note, I cut the power to the monitor overnight to save electricity, can this mess up something in the Radeon card or X server? What combinations would be most useful to try next? Uli