Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 18 Oct 2020 16:24:21 -0600
From:      Bob Proulx <bob@proulx.com>
To:        freebsd-questions@freebsd.org
Subject:   Re: sh scripting question
Message-ID:  <20201018155812796938832@bob.proulx.com>
In-Reply-To: <20201018215421.6BA9B23A1462@ary.qy>
References:  <20201018144327254822114@bob.proulx.com> <20201018215421.6BA9B23A1462@ary.qy>

next in thread | previous in thread | raw e-mail | index | archive | help
John Levine wrote:
> In article <20201018144327254822114@bob.proulx.com> you write:
> >> Since find is in use, I think the canonical solution is
> >> to use "find -print0"..."xargs -0"
> 
> >Here is an example, I will use a "ls -ld" command just to make it a
> >real concrete example and perhaps easier to read that way.
> >
> >    find . -exec ls -ld {} +
> 
> Sometimes that's better, sometimes not.  I have find scripts that delete
> stale files, and it is a lot faster to use xargs to run "rm" once for
> each thousand files than once per file.

Ah!  Not quite.  Let me explain.  Since when using "{} +" it only
invokes the command *once* not multiple times.  "{} \;" invokes the
command once per argument.  But "{} +" invokes the command with as
many arguments as possible to fit within ARG_MAX (if arg_max is a
limitation, it is kernel dependent).  On systems with ARG_MAX as the
limitation then if the argument list would overflow ARG_MAX then the
command will be invoked multiple times.  Otherwise only once.

The find manual says this about the two cases.

     -exec utility [argument ...] ;
             True if the program named utility returns a zero value as its
             exit status.  Optional arguments may be passed to the utility.
             The expression must be terminated by a semicolon (“;”).  If you
             invoke find from a shell you may need to quote the semicolon if
             the shell would otherwise treat it as a control operator.  If the
             string “{}” appears anywhere in the utility name or the arguments
             it is replaced by the pathname of the current file.  Utility will
             be executed from the directory from which find was executed.
             Utility and arguments are not subject to the further expansion of
             shell patterns and constructs.

     -exec utility [argument ...] {} +
             Same as -exec, except that “{}” is replaced with as many
             pathnames as possible for each invocation of utility.  This
             behaviour is similar to that of xargs(1).  The primary always
             returns true; if at least one invocation of utility returns a
             non-zero exit status, find will return a non-zero exit status.

The key difference here is this second part, "replaced with as many
pathnames as possible for each invocation of utility."  Instead of one
argument per invocation causing many invocations, it is as many
arguments as possible with each invocation.

Let's try some testing.  Let's use sleep to hold the processes alive
for a bit so we can verify with ps what is happening.

    rwp@outrage:/tmp/junk$ time find . -type f -exec sh -c 'sleep 3; echo "$@"' sh {} +
    ./one ./two ./three ./four ./five ./six ./seven ./eight ./nine ./ten

    real    0m3.032s
    user    0m0.000s
    sys     0m0.010s

After starting the above I quickly ran ps in a different terminal so I
could see the processes.

    rwp@outrage:~$ ps
     PID TT  STAT    TIME COMMAND
    8927  0  Ss   0:00.17 -bash (bash)
    9513  0  S+   0:00.00 find . -type f -exec sh -c sleep 3; echo "$@" sh {} +
    9514  0  S+   0:00.00 sh -c sleep 3; echo "$@" sh ./one ./two ./three ./four ./five ./six ./sev
    9515  0  SC+  0:00.00 sleep 3
    9379  1  Ss   0:00.02 -bash (bash)
    9516  1  R+   0:00.00 ps

Therefore "find . -exec somecommand {} +" is usually more efficient
than piping the character stream of file names to xargs using two
processes like this:

    find . -type f -print0 | xargs -r0 somecomand

That has an increased amount of I/O of the long stream of file names
that must be written out and then read in by the different processes.
That can be more resources than not doing so.  But as with many things
it all depends upon the exact workload.

The place where xargs is definitely an improvement is when using xargs
to fork off many parallel processes.  If the task is a long running
task and needs to crunch a while and the system has multiple cores
such that these can be done in parallel then that is a definite
improvement.  But that adds significant complexity and so far the
examples haven't needed it.

We can simulate that case by using sleep to idle for a few seconds.
And then we will do nothing but echo out the argument list.

    rwp@outrage:/tmp/junk$ touch one two three four five six seven eight nine ten

    rwp@outrage:/tmp/junk$ time find . -type f -print0 | xargs -r0 -P 5 -n 1 sh -c 'sleep 3;echo "$@"' sh
    ./three
    ./five
    ./four
    ./one
    ./two
    ./ten
    ./six
    ./seven
    ./nine
    ./eight

    real    0m6.157s
    user    0m0.022s
    sys     0m0.030s

Here there are ten files and I said -P 5 parallel processes.  This can
then handle everything in two passes when using a 5 process limit.  My
command simulating doing work slept idle for 3 seconds.  I told it
only one argument per (parallel) process using -n 1.  After starting
this in one terminal then in another terminal I looked at the
processes running.

    rwp@outrage:~$ ps
     PID TT  STAT    TIME COMMAND
    8927  0  Ss   0:00.14 -bash (bash)
    9443  0  S+   0:00.00 xargs -r0 -P 5 -n 1 sh -c sleep 3;echo "$@" sh
    9444  0  S+   0:00.00 sh -c sleep 3;echo "$@" sh ./one
    9445  0  S+   0:00.00 sh -c sleep 3;echo "$@" sh ./two
    9446  0  S+   0:00.00 sh -c sleep 3;echo "$@" sh ./three
    9447  0  SC+  0:00.00 sleep 3
    9448  0  SC+  0:00.00 sleep 3
    9449  0  S+   0:00.00 sh -c sleep 3;echo "$@" sh ./four
    9450  0  S+   0:00.00 sh -c sleep 3;echo "$@" sh ./five
    9451  0  SC+  0:00.00 sleep 3
    9452  0  SC+  0:00.00 sleep 3
    9453  0  SC+  0:00.00 sleep 3
    9379  1  Ss   0:00.02 -bash (bash)
    9454  1  R+   0:00.00 ps

There are five sleep processes idling simulating the job working.
And after 3 seconds those first set of five files were echo'd out.
And then after the second pass with another 3 seconds of idle the
second set were echo'd out.

Because I created the files in a new freshly created directory with
touch in order the files were listed out by find in the same order.
It's just a quirk of the order in which things were created in that
freshly created directory.  A busy directory would have likely
scrambled things up somewhat.

    rwp@outrage:/tmp/junk$ rm -rf /tmp/junk
    rwp@outrage:/tmp/junk$ mkdir /tmp/junk
    rwp@outrage:/tmp/junk$ cd /tmp/junk
    rwp@outrage:/tmp/junk$ touch one two three four five six seven eight nine ten
    rwp@outrage:/tmp/junk$ ls -f
    ./      ../     one     two     three   four    five    six     seven   eight   nine    ten

I am hoping other people are having as much fun with this thread as I
am having with it.  :-)

Bob



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20201018155812796938832>