Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 18 Jul 2011 04:44:10 -0500 (CDT)
From:      Robert Bonomi <bonomi@mail.r-bonomi.com>
To:        f.bonnet@esiee.fr, freebsd-questions@freebsd.org
Subject:   Re: Tools to find "unlegal" files ( videos , music etc )
Message-ID:  <201107180944.p6I9iAJ9022931@mail.r-bonomi.com>
In-Reply-To: <4E23F51E.7040907@esiee.fr>

Next in thread | Previous in thread | Raw E-Mail | Index | Archive | Help
> From owner-freebsd-questions@freebsd.org  Mon Jul 18 03:55:59 2011
> Date: Mon, 18 Jul 2011 10:55:58 +0200
> From: Frank Bonnet <f.bonnet@esiee.fr>
> To: freebsd-questions@freebsd.org
> Subject: Re: Tools to find "unlegal" files ( videos , music etc )
>
> On 07/18/2011 10:45 AM, Polytropon wrote:
> > On Mon, 18 Jul 2011 10:38:22 +0200, Frank Bonnet wrote:
> >> On 07/18/2011 10:10 AM, Polytropon wrote:
> >>> On Mon, 18 Jul 2011 09:55:09 +0200, Frank Bonnet wrote:
> >>>> Hello
> >>>>
> >>>> Anyone knows an utility that I could pipe to the "find" command
> >>>> in order to detect video, music, games ... etc  files ?
> >>>>
> >>>> I need a tool that could "inspect" inside files because many users
> >>>> rename those filename to "inoffensive" ones :-)
> >>> One way could be to define a list of file extensions that
> >>> commonly matches the content you want to track. Of course,
> >>> the file name does not directly correspond to the content,
> >>> but it often gives a good hint to search for *.wmv, *.flv,
> >>> *.avi, *.mp(e)g, *.mp3, *.wma, *.exe - and of course all
> >>> the variations of the extensions with uppercase letters.
> >>> Also consider *.rar and maybe *.zip for compressed content.
> >>>
> >>> If file extensions have been manipulated (rare case), the
> >>> "file" command can still identify the correct file type.
> >>>
> >>>
> >>>
> >>>
> >> yes thanks , gonna try with the file command
> > You could make a simple script that lists "file" output for
> > all files (just to be sure because of possible suffix renaming)
> > for further inspection. Sometimes, you can also run "strings"
> > for a given file - maybe that can be used to identify typical
> > suspicious string patters for a "strings + grep" combination
> > so less manual identification has to be done.
> >
> >
>
> yes , my main problem is the huge number of files
> but anyway I'm gonna first check files greater than 500 Mb
> it could be a good start

That's what 'find(1)' is for.  Something like (run as superuser):

 find / -exec  ./inspect {} >> /tmp/suspects \; 

with './inspect' being a trivial (executable!) shell-script:

    #!/bin/sh
    file $1 | awk -f  ./inspect.awk

and './inspect.awk' is:

          {file = $1 ; $1 = "";}
/regex1/  {printf("%s  %s\n",file,$0;next);
/regex2/  {printf("%s  %s\n",file,$0;next);
/regex3/  {printf("%s  %s\n",file,$0;next);
  ...      ...
  ...      ...
          {next;}

where 'regex1', 'regex2', etc. are things to select 'files' of interest,
based on what 'file' reports.  The awk code strips out the file name, so
that the regex will match only against the 'file' output, with no false-
Positives against a substring in the file name itself.

See the find(1) manpage for things you can put before the '-exec' param,
to filter by size, etc.  You can also limit the search to a specific
part of the filesystem tree, by replacing '/' with the name of the directory
hierarchy you want to search -- e.g. '/home' (if that's where all 'user'
files are) -- although, 'for completeness' (given the 'legal" issues)  you
may well want to run it over 'everything'.





Want to link to this message? Use this URL: <http://docs.FreeBSD.org/cgi/mid.cgi?201107180944.p6I9iAJ9022931>