From owner-freebsd-current@freebsd.org  Mon Apr 23 07:48:14 2018
Return-Path: <owner-freebsd-current@freebsd.org>
Delivered-To: freebsd-current@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C2198FA35F9
 for <freebsd-current@mailman.ysv.freebsd.org>;
 Mon, 23 Apr 2018 07:48:13 +0000 (UTC)
 (envelope-from julian@freebsd.org)
Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "vps1.elischer.org",
 Issuer "CA Cert Signing Authority" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 5AB47785C2
 for <freebsd-current@freebsd.org>; Mon, 23 Apr 2018 07:48:13 +0000 (UTC)
 (envelope-from julian@freebsd.org)
Received: from Julian-MBP3.local (220-253-154-11.dyn.iinet.net.au
 [220.253.154.11]) (authenticated bits=0)
 by vps1.elischer.org (8.15.2/8.15.2) with ESMTPSA id w3N7lgIn057618
 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO);
 Mon, 23 Apr 2018 00:47:45 -0700 (PDT)
 (envelope-from julian@freebsd.org)
Subject: Re: SCHED_ULE makes 256Mbyte i386 unusable
To: "Rodney W. Grimes" <freebsd-rwg@pdx.rh.CN85.dnsmgr.net>,
 Rick Macklem <rmacklem@uoguelph.ca>
Cc: Konstantin Belousov <kostikbel@gmail.com>,
 "freebsd-current@freebsd.org" <freebsd-current@freebsd.org>,
 George Mitchell <george+freebsd@m5p.com>,
 Peter <pmc@citylink.dinoex.sub.org>
References: <201804221436.w3MEa9DY080702@pdx.rh.CN85.dnsmgr.net>
From: Julian Elischer <julian@freebsd.org>
Message-ID: <6f5fbe1e-6da3-c4ed-ddc3-1629ad2d3058@freebsd.org>
Date: Mon, 23 Apr 2018 15:47:37 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0)
 Gecko/20100101 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <201804221436.w3MEa9DY080702@pdx.rh.CN85.dnsmgr.net>
Content-Language: en-US
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Apr 2018 07:48:14 -0000

On 22/4/18 10:36 pm, Rodney W. Grimes wrote:
>> Konstantin Belousov wrote:
>>> On Sat, Apr 21, 2018 at 11:30:55PM +0000, Rick Macklem wrote:
>>>> Konstantin Belousov wrote:
>>>>> On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:
>>>>>> I decided to start a new thread on current related to SCHED_ULE, since I see
>>>>>> more than just performance degradation and on a recent current kernel.
>>>>>> (I cc'd a couple of the people discussing performance problems in freebsd-stable
>>>>>>   recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler".
>>>>>>
>>>>>> When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017
>>>>>> current/head kernel, I would see about a 30% performance degradation (elapsed
>>>>>> run time for a kernel build over NFSv4.1) when the server kernel was built with
>>>>>> options SCHED_ULE
>>>>>> instead of
>>>>>> options SCHED_4BSD
>> So, now that I have decreased the number of nfsd kernel threads to 32, it works
>> with both schedulers and with essentially the same performance. (ie. The 30%
>> performance degradation has disappeared.)
>>
>>>>>> Now, with a kernel from a couple of days ago, the
>>>>>> options SCHED_ULE
>>>>>> kernel becomes unusable shortly after starting testing.
>>>>>> I have seen two variants of this:
>>>>>> - Became essentially hung. All I could do was ping the machine from the network.
>>>>>> - Reported "vm_thread_new: kstack allocation failed
>>>>>>    and then any attempt to do anything gets "No more processes".
>>>>> This is strange.  It usually means that you get KVA either exhausted or
>>>>> severly fragmented.
>>>> Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE
>>>> kernel is working ok now. I haven't done enough to compare performance yet.
>>>> Maybe I'll post again when I have some numbers.
>>>>
>>>>> Enter ddb, it should be operational since pings are replied.  Try to see
>>>>> where the threads are stuck.
>>>> I didn't do this, since reducing the number of kernel threads seems to have fixed
>>>> the problem. For the pNFS server, the nfsd threads will spawn additional kernel
>>>> threads to do proxies to the mirrored DS servers.
>>>>
>>>>>> with the only difference being a kernel built with
>>>>>> options SCHED_4BSD
>>>>>> everything works and performs the same as the Dec 2017 kernel.
>>>>>>
>>>>>> I can try rolling back through the revisions, but it would be nice if someone
>>>>>> could suggest where to start, because it takes a couple of hours to build a
>>>>>> kernel on this system.
>>>>>>
>>>>>> So, something has made things worse for a head/current kernel this winter, rick
>>>>> There are at least two potentially relevant changes.
>>>>>
>>>>> First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.
>>>> I've been running this machine with KSTACK_PAGES=4 for some time, so no change.
>> W.r.t. Rodney Grimes comments about this (which didn't end up in this messages
>> in the thread):
>> I didn't see any instability when using KSTACK_PAGES=4 for this until this cropped
>> up and seemed to be scheduler related (but not really, it seems).
>> I bumped it to KSTACK_PAGES=4 because I needed that for the pNFS Metadata
>> Server code.
>>
>> Yes, NFS does use quite a bit of kernel stack. Unfortunately, it isn't one big
>> item getting allocated on the stack, but many moderate sized ones.
>> (A part of it is multiple instances of "struct vattr", some buried in "struct nfsvattr",
>>   that NFS needs to use. I don't think these are large enough to justify malloc/free,
>>   but it has to use several of them.)
>>
>> One case I did try fixing was about 6 cases where "struct nfsstate" ended up on
>> the stack. I changes the code to malloc/free them and then when testing, to
>> my surprise I had a 20% performance hit and shelved the patch.
>> Now that I know that the server was running near its limit, I might try this one
>> again, to see if the performance hit doesn't occur when the machine has adequate
>> memory. If the performance hit goes away, I could commit this, but it wouldn't
>> have that much effect on the kstack usage. (It's interesting how this patch ended
>> up related to the issue this thread discussed.)
> Anything we can do to help relieve KSTACK usage, especially on i386
> is helpfull.  These is a thread back quite some time where someone
> came up with a compile time static "this functions uses X bytes of
> local stack" and a bit of clean up was done.  We should persue
> this issue further.

that was me.

use
|-Wframe-larger-than||=<arg>|¶ 
<https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-wframe-larger-than> 
and set it to something like 512 bytes (obviously you have to make 
warnings non fatal as well).


>
> My experiece with the i386/KSTACK issues was attempting to do installs
> from snapshot .iso's, I usually had to change to a custom kernel without
> INVARIANTS and WITNESS, or reduce KSTACK to 2 and suffer the small stack
> problem (ie, dont use NFS during install).  Neither was very pleasant.
>
> I have found it in practical to run the 4 page KSTACK in production
> VM's using i386 due to memory requirements.  I run many very lean
> i386 VM's with 64MB of memory.  I suspect our user base also has
> many people doing this, and it would be to our advantage to try
> and reduce our kernel stack needs.
>
>
>>>>> Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.
>>>> Could this change have resulted in the system being able to allocate fewer
>>>> kernel threads/stacks for some reason?
>>> Well, it could, as anything can be buggy. But the intent of the change
>>> was to give 4G KVA, and it did.
>> Righto. No concern here. I suspect the Dec. 2017 kernel was close to the limit
>> (see performance issue that went away, noted above) and any change could
>> have pushed it across the line, I think.
>>
>>>>> Consequences of the first one are obvious, it is much harder to find
>>>>> the place to map the stack.  Second change, on the other hand, provides
>>>>> almost full 4G for KVA and should have mostly compensate for the negative
>>>>> effects of the first.
>>>>>
>>>>> And, I cannot see how changing the scheduler would fix or even affect that
>>>>> behaviour.
>>>> My hunch is that the system was running near its limit for kernel threads/stacks.
>>>> Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying to get
>>>> to a higher peak number of threads and hit the limit.
>>>> SCHED_4BSD happened to result in timing such that it stayed just below the
>>>> limit and worked.
>>>> I can think of a couple of things that might affect this:
>>>> 1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly, then
>>>>        they wouldn't terminate and release their resources before more new ones
>>>>        are spawned.
>>> Scheduler has nothing to do with the threads termination.  It might
>>> select running threads in a way that causes the undesired pattern to
>>> appear which might create some amount of backlog for termination, but
>>> I doubt it.
>>>
>>>> 2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then the burst
>>>>        could try and spawn more mirror DS worker threads at about the same time.
>>>>
>>>> Anyhow, thanks for the help, rick
>> Have a good day, rick
>> _______________________________________________
>> freebsd-current@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-current
>> To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
>>