From owner-freebsd-current@freebsd.org Tue Mar 29 06:08:49 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id AAD91AE155D for ; Tue, 29 Mar 2016 06:08:49 +0000 (UTC) (envelope-from ohartman@zedat.fu-berlin.de) Received: from outpost1.zedat.fu-berlin.de (outpost1.zedat.fu-berlin.de [130.133.4.66]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 67DBF1FF3; Tue, 29 Mar 2016 06:08:49 +0000 (UTC) (envelope-from ohartman@zedat.fu-berlin.de) Received: from inpost2.zedat.fu-berlin.de ([130.133.4.69]) by outpost.zedat.fu-berlin.de (Exim 4.85) with esmtps (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (envelope-from ) id <1akmpb-001dxm-00>; Tue, 29 Mar 2016 08:08:47 +0200 Received: from p578a69f9.dip0.t-ipconnect.de ([87.138.105.249] helo=freyja.zeit4.iv.bundesimmobilien.de) by inpost2.zedat.fu-berlin.de (Exim 4.85) with esmtpsa (TLSv1.2:AES128-GCM-SHA256:128) (envelope-from ) id <1akmpa-001P5j-KO>; Tue, 29 Mar 2016 08:08:46 +0200 Date: Tue, 29 Mar 2016 08:08:40 +0200 From: "O. Hartmann" To: Don Lewis Cc: imb@protected-networks.net, kmacy@freebsd.org, freebsd-current@freebsd.org Subject: Re: CURRENT slow and shaky network stability Message-ID: <20160329080840.3da929de@freyja.zeit4.iv.bundesimmobilien.de> In-Reply-To: <201603282152.u2SLq9HN086958@gw.catspoiler.org> References: <20160328084440.501ef862.ohartman@zedat.fu-berlin.de> <201603282152.u2SLq9HN086958@gw.catspoiler.org> Organization: FU Berlin X-Mailer: Claws Mail 3.13.2 (GTK+ 2.24.29; amd64-portbld-freebsd11.0) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Originating-IP: 87.138.105.249 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Mar 2016 06:08:49 -0000 On Mon, 28 Mar 2016 14:52:09 -0700 (PDT) Don Lewis wrote: > On 28 Mar, O. Hartmann wrote: > > Am Sat, 26 Mar 2016 14:26:45 -0700 (PDT) > > Don Lewis schrieb: > > > >> On 26 Mar, Michael Butler wrote: > >> > -current is not great for interactive use at all. The strategy of > >> > pre-emptively dropping idle processes to swap is hurting .. big time. > >> > > >> > Compare inactive memory to swap in this example .. > >> > > >> > 110 processes: 1 running, 108 sleeping, 1 zombie > >> > CPU: 1.2% user, 0.0% nice, 4.3% system, 0.0% interrupt, 94.5% idle > >> > Mem: 474M Active, 1609M Inact, 764M Wired, 281M Buf, 119M Free > >> > Swap: 4096M Total, 917M Used, 3178M Free, 22% Inuse > >> > > >> > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU > >> > COMMAND > >> > 1819 imb 1 28 0 213M 11284K select 1 147:44 5.97% > >> > gkrellm > >> > 59238 imb 43 20 0 980M 424M select 0 10:07 1.92% > >> > firefox > >> > > >> > .. it shouldn't start randomly swapping out processes because they're > >> > used infrequently when there's more than enough RAM to spare .. > >> > >> I don't know what changed, and probably something can use some tweaking, > >> but paging out idle processes isn't always the wrong thing to do. For > >> instance if I'm using poudriere to build a bunch of packages and its > >> heavy use of tmpfs is pushing the machine into many GB of swap usage, I > >> don't want interactive use like: > >> vi foo.c > >> cc foo.c > >> vi foo.c > >> to suffer because vi and cc have to be read in from a busy hard drive > >> each time while unused console getty and idle sshd processes in a bunch > >> of jails are still hanging on to memory even though they haven't > >> executed any instructions since shortly after the machine was booted > >> weeks ago. > >> > >> > It also shows up when trying to reboot .. on all of my gear, 90 seconds > >> > of "fail-safe" time-out is no longer enough when a good proportion of > >> > daemons have been dropped onto swap and must be brought back in to flush > >> > their data segments :-( > >> > >> That's a different and known problem. See: > >> > > > > CURRENT has rendered unusable and faulty. Updating ports for poudriere ends > > up in this error/broken pipe from remote console: > > > > [~] poudriere ports -u -p head > > [00:00:00] ====>> Updating portstree "head" > > [00:00:00] ====>> Updating the ports tree... done > > root@gate [~] Fssh_packet_write_wait: Connection to 192.168.250.111 port > > 22: Broken pipe > > > > > > Although not under load, several processes over time gets idled/paged out - > > and they never recover, the connection is then sabott, the whole thing > > unusable :-( > > I'm definitely not seeing that here. This is getting close to the end > of a big poudriere run: > > last pid: 82549; load averages: 20.05, 20.72, 23.51 up 5+12:34:14 > 12:51:55 144 processes: 20 running, 109 sleeping, 15 stopped > CPU: 85.3% user, 0.0% nice, 14.7% system, 0.0% interrupt, 0.0% idle > Mem: 1082M Active, 19G Inact, 9718M Wired, 249M Buf, 1095M Free > ARC: 3841M Total, 2039M MFU, 642M MRU, 3395K Anon, 111M Header, 1044M Other > Swap: 40G Total, 9691M Used, 31G Free, 23% Inuse, 196K In > > At the moment, openoffice-4, openoffice-devel, libreoffice, and chromium > are all being built and are using tmpfs for "wrkdir data localbase", so > there are many GB of data in tmpfs, which is the reason for the high > inact and swap usage. I just hit the return key in an idle (for a > couple of hours) terminal window containing an ssh login session to the > same machine. I got a fresh command prompt essentially instantaneously. > It couldn't have taken more than a couple hundred milliseconds to wake > up and page in the idle sshd and shell processes on the build server. > > [a couple hours later, after poudriere is done and all tmpfs is gone] > > last pid: 66089; load averages: 0.13, 1.59, 4.61 up 5+14:14:33 > 14:32:14 71 processes: 1 running, 55 sleeping, 15 stopped > CPU: 3.1% user, 0.0% nice, 0.0% system, 0.0% interrupt, 96.9% idle > Mem: 58M Active, 85M Inact, 12G Wired, 249M Buf, 19G Free > ARC: 6249M Total, 2792M MFU, 2246M MRU, 16K Anon, 133M Header, 1078M Other > Swap: 40G Total, 81M Used, 40G Free > > [after tracking down and exiting all of those stopped processes] > > last pid: 66103; load averages: 0.20, 0.99, 3.80 up 5+14:17:18 > 14:34:59 56 processes: 1 running, 55 sleeping > CPU: 0.0% user, 0.0% nice, 0.1% system, 0.1% interrupt, 99.9% idle > Mem: 57M Active, 88M Inact, 12G Wired, 249M Buf, 19G Free > ARC: 6251M Total, 2793M MFU, 2247M MRU, 16K Anon, 133M Header, 1078M Other > Swap: 40G Total, 63M Used, 40G Free > > The biggest chunk of the 63 MB of swap appears to be nginx. It's > process size is 29 MB, but it has zero resident. It hasn't executed any > code since it was first started when I booted the system several days > ago. Other consumers appear to be getty and sshd and syslogd in various > untouched jails. > > > I've seen reports that r296137 and r297267 show the ssh problem, but > this machine is in the middle with r297204 and I don't see it. > > As mentioned previously, I'm not running Xorg and a bunch of bloated > X11 clients on this machine. Those make fat targets for having RAM > taken from them, which would probably make my interactive experience > less pleasant, but that should still not affect ssh. > > On my FreeBSD 10 machine, which has only 8 GB of RAM, my experience is > that firefox gets pretty bloated after a while. It's currently at 2.6 > GB (with 2.8 GB of swap currently in use - I've got some other RAM hogs > running as well) and I'm not seeing any problems, but when it gets up in > the 4-5 GB range, things can start to get pretty laggy, but I don't see > problems with ssh. The biggest problem with firefox seems to be > javascript, which seems to leak memory like a sieve. Making heavy use > of the noscript plugin is the only way to keep Firefox usable. > > The only thing I can think of is that this is triggered by something in > the machine configuration or the specific hardware. I'm running a > GENERIC kernel and the only non-standard modification to /usr/src is the > dummynet AQM patchset. The latter should have no effect since I"m not > using ipfw on this machine. > > If I get a chance, I try booting my FreeBSD 11 machine with less RAM to > see if that is a trigger. Several of my boxes do not run X11 or "... a bunch of bloated X11 clients" and they run with 8 GB, 16 GB or 32 GB of RAM (the latter one does have X11). On all remote systems with most recent CURRENT (we are talking about r297237 - 297369 tight now) I definitely do not get "immediately" a fresh prompt. it takes up to 60 seconds (and more) to recover, even if the box is in a state of unemployment (idle!). In a seriously rising bunch of cases I get now broken pipes. This also happens with sessions, when performing "poudriere options" on larger installations and this is completely unacceptable.