From owner-freebsd-questions@freebsd.org Sun Oct 18 22:24:24 2020 Return-Path: Delivered-To: freebsd-questions@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 1881B43E01A for ; Sun, 18 Oct 2020 22:24:24 +0000 (UTC) (envelope-from bob@proulx.com) Received: from havoc.proulx.com (havoc.proulx.com [96.88.95.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4CDvZH2LVFz3Xbj for ; Sun, 18 Oct 2020 22:24:23 +0000 (UTC) (envelope-from bob@proulx.com) Received: from joseki.proulx.com (localhost [127.0.0.1]) by havoc.proulx.com (Postfix) with ESMTP id B6D4A6BF for ; Sun, 18 Oct 2020 16:24:21 -0600 (MDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=proulx.com; s=dkim2048; t=1603059861; bh=PBlhbqt42MX4dcznCH2T8cEyCuX2TB5IB7JyH7ocPuM=; h=Date:From:To:Subject:References:In-Reply-To:From; b=dexALBGuLYL62ZKmA3lK2PBIuKGuxJQwl8hVQj2W7m0Pf6cjy5Lh3cw9JQv8P/MyJ 9NWv8FdRPwdHFPgKha7V0DcNdLaESPy9dqzS51v34oCg32wqrHPgyIRJ1EQ4i2hJKo gITJGsMHrRhMLazS7Krh+snGFPOGK8iPYgF21XtzAZ4u1tvcN54j5VwoAJ5QWpYGjG qGN9xf8dC/RoloWO+nQTuhBeajz4MZvdG1BHLz6z20vIF0++AN1K8wnopqVVgZOgfR MlUxls1dEgVA8QQnPotsA8k1U8ihe0DMZbYXc1nrTPgHlNBfgjqnK2rD3NRNqtFUqE oKEUZMtizMZ0g== Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119]) by joseki.proulx.com (Postfix) with ESMTP id 678812112F for ; Sun, 18 Oct 2020 16:24:21 -0600 (MDT) Received: by hysteria.proulx.com (Postfix, from userid 1000) id 51CDC2DC9D; Sun, 18 Oct 2020 16:24:21 -0600 (MDT) Date: Sun, 18 Oct 2020 16:24:21 -0600 From: Bob Proulx To: freebsd-questions@freebsd.org Subject: Re: sh scripting question Message-ID: <20201018155812796938832@bob.proulx.com> References: <20201018144327254822114@bob.proulx.com> <20201018215421.6BA9B23A1462@ary.qy> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20201018215421.6BA9B23A1462@ary.qy> X-Rspamd-Queue-Id: 4CDvZH2LVFz3Xbj X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=proulx.com header.s=dkim2048 header.b=dexALBGu; dmarc=pass (policy=none) header.from=proulx.com; spf=pass (mx1.freebsd.org: domain of bob@proulx.com designates 96.88.95.61 as permitted sender) smtp.mailfrom=bob@proulx.com X-Spamd-Result: default: False [-3.58 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.06)[-1.061]; R_DKIM_ALLOW(-0.20)[proulx.com:s=dkim2048]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; R_SPF_ALLOW(-0.20)[+a]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-questions@freebsd.org]; TO_DN_NONE(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; NEURAL_HAM_LONG(-0.98)[-0.981]; RCVD_COUNT_THREE(0.00)[3]; DKIM_TRACE(0.00)[proulx.com:+]; DMARC_POLICY_ALLOW(-0.50)[proulx.com,none]; NEURAL_HAM_SHORT(-0.54)[-0.535]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:7922, ipnet:96.64.0.0/11, country:US]; MAILMAN_DEST(0.00)[freebsd-questions] X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Oct 2020 22:24:24 -0000 John Levine wrote: > In article <20201018144327254822114@bob.proulx.com> you write: > >> Since find is in use, I think the canonical solution is > >> to use "find -print0"..."xargs -0" > > >Here is an example, I will use a "ls -ld" command just to make it a > >real concrete example and perhaps easier to read that way. > > > > find . -exec ls -ld {} + > > Sometimes that's better, sometimes not. I have find scripts that delete > stale files, and it is a lot faster to use xargs to run "rm" once for > each thousand files than once per file. Ah! Not quite. Let me explain. Since when using "{} +" it only invokes the command *once* not multiple times. "{} \;" invokes the command once per argument. But "{} +" invokes the command with as many arguments as possible to fit within ARG_MAX (if arg_max is a limitation, it is kernel dependent). On systems with ARG_MAX as the limitation then if the argument list would overflow ARG_MAX then the command will be invoked multiple times. Otherwise only once. The find manual says this about the two cases. -exec utility [argument ...] ; True if the program named utility returns a zero value as its exit status. Optional arguments may be passed to the utility. The expression must be terminated by a semicolon (“;”). If you invoke find from a shell you may need to quote the semicolon if the shell would otherwise treat it as a control operator. If the string “{}” appears anywhere in the utility name or the arguments it is replaced by the pathname of the current file. Utility will be executed from the directory from which find was executed. Utility and arguments are not subject to the further expansion of shell patterns and constructs. -exec utility [argument ...] {} + Same as -exec, except that “{}” is replaced with as many pathnames as possible for each invocation of utility. This behaviour is similar to that of xargs(1). The primary always returns true; if at least one invocation of utility returns a non-zero exit status, find will return a non-zero exit status. The key difference here is this second part, "replaced with as many pathnames as possible for each invocation of utility." Instead of one argument per invocation causing many invocations, it is as many arguments as possible with each invocation. Let's try some testing. Let's use sleep to hold the processes alive for a bit so we can verify with ps what is happening. rwp@outrage:/tmp/junk$ time find . -type f -exec sh -c 'sleep 3; echo "$@"' sh {} + ./one ./two ./three ./four ./five ./six ./seven ./eight ./nine ./ten real 0m3.032s user 0m0.000s sys 0m0.010s After starting the above I quickly ran ps in a different terminal so I could see the processes. rwp@outrage:~$ ps PID TT STAT TIME COMMAND 8927 0 Ss 0:00.17 -bash (bash) 9513 0 S+ 0:00.00 find . -type f -exec sh -c sleep 3; echo "$@" sh {} + 9514 0 S+ 0:00.00 sh -c sleep 3; echo "$@" sh ./one ./two ./three ./four ./five ./six ./sev 9515 0 SC+ 0:00.00 sleep 3 9379 1 Ss 0:00.02 -bash (bash) 9516 1 R+ 0:00.00 ps Therefore "find . -exec somecommand {} +" is usually more efficient than piping the character stream of file names to xargs using two processes like this: find . -type f -print0 | xargs -r0 somecomand That has an increased amount of I/O of the long stream of file names that must be written out and then read in by the different processes. That can be more resources than not doing so. But as with many things it all depends upon the exact workload. The place where xargs is definitely an improvement is when using xargs to fork off many parallel processes. If the task is a long running task and needs to crunch a while and the system has multiple cores such that these can be done in parallel then that is a definite improvement. But that adds significant complexity and so far the examples haven't needed it. We can simulate that case by using sleep to idle for a few seconds. And then we will do nothing but echo out the argument list. rwp@outrage:/tmp/junk$ touch one two three four five six seven eight nine ten rwp@outrage:/tmp/junk$ time find . -type f -print0 | xargs -r0 -P 5 -n 1 sh -c 'sleep 3;echo "$@"' sh ./three ./five ./four ./one ./two ./ten ./six ./seven ./nine ./eight real 0m6.157s user 0m0.022s sys 0m0.030s Here there are ten files and I said -P 5 parallel processes. This can then handle everything in two passes when using a 5 process limit. My command simulating doing work slept idle for 3 seconds. I told it only one argument per (parallel) process using -n 1. After starting this in one terminal then in another terminal I looked at the processes running. rwp@outrage:~$ ps PID TT STAT TIME COMMAND 8927 0 Ss 0:00.14 -bash (bash) 9443 0 S+ 0:00.00 xargs -r0 -P 5 -n 1 sh -c sleep 3;echo "$@" sh 9444 0 S+ 0:00.00 sh -c sleep 3;echo "$@" sh ./one 9445 0 S+ 0:00.00 sh -c sleep 3;echo "$@" sh ./two 9446 0 S+ 0:00.00 sh -c sleep 3;echo "$@" sh ./three 9447 0 SC+ 0:00.00 sleep 3 9448 0 SC+ 0:00.00 sleep 3 9449 0 S+ 0:00.00 sh -c sleep 3;echo "$@" sh ./four 9450 0 S+ 0:00.00 sh -c sleep 3;echo "$@" sh ./five 9451 0 SC+ 0:00.00 sleep 3 9452 0 SC+ 0:00.00 sleep 3 9453 0 SC+ 0:00.00 sleep 3 9379 1 Ss 0:00.02 -bash (bash) 9454 1 R+ 0:00.00 ps There are five sleep processes idling simulating the job working. And after 3 seconds those first set of five files were echo'd out. And then after the second pass with another 3 seconds of idle the second set were echo'd out. Because I created the files in a new freshly created directory with touch in order the files were listed out by find in the same order. It's just a quirk of the order in which things were created in that freshly created directory. A busy directory would have likely scrambled things up somewhat. rwp@outrage:/tmp/junk$ rm -rf /tmp/junk rwp@outrage:/tmp/junk$ mkdir /tmp/junk rwp@outrage:/tmp/junk$ cd /tmp/junk rwp@outrage:/tmp/junk$ touch one two three four five six seven eight nine ten rwp@outrage:/tmp/junk$ ls -f ./ ../ one two three four five six seven eight nine ten I am hoping other people are having as much fun with this thread as I am having with it. :-) Bob