From owner-freebsd-questions@FreeBSD.ORG Mon Jul 18 09:44:39 2011 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D569E106564A for ; Mon, 18 Jul 2011 09:44:39 +0000 (UTC) (envelope-from bonomi@mail.r-bonomi.com) Received: from mail.r-bonomi.com (mx-out.r-bonomi.com [204.87.227.120]) by mx1.freebsd.org (Postfix) with ESMTP id 7B8A88FC08 for ; Mon, 18 Jul 2011 09:44:38 +0000 (UTC) Received: (from bonomi@localhost) by mail.r-bonomi.com (8.14.4/rdb1) id p6I9iAJ9022931; Mon, 18 Jul 2011 04:44:10 -0500 (CDT) Date: Mon, 18 Jul 2011 04:44:10 -0500 (CDT) From: Robert Bonomi Message-Id: <201107180944.p6I9iAJ9022931@mail.r-bonomi.com> To: f.bonnet@esiee.fr, freebsd-questions@freebsd.org In-Reply-To: <4E23F51E.7040907@esiee.fr> Cc: Subject: Re: Tools to find "unlegal" files ( videos , music etc ) X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Jul 2011 09:44:39 -0000 > From owner-freebsd-questions@freebsd.org Mon Jul 18 03:55:59 2011 > Date: Mon, 18 Jul 2011 10:55:58 +0200 > From: Frank Bonnet > To: freebsd-questions@freebsd.org > Subject: Re: Tools to find "unlegal" files ( videos , music etc ) > > On 07/18/2011 10:45 AM, Polytropon wrote: > > On Mon, 18 Jul 2011 10:38:22 +0200, Frank Bonnet wrote: > >> On 07/18/2011 10:10 AM, Polytropon wrote: > >>> On Mon, 18 Jul 2011 09:55:09 +0200, Frank Bonnet wrote: > >>>> Hello > >>>> > >>>> Anyone knows an utility that I could pipe to the "find" command > >>>> in order to detect video, music, games ... etc files ? > >>>> > >>>> I need a tool that could "inspect" inside files because many users > >>>> rename those filename to "inoffensive" ones :-) > >>> One way could be to define a list of file extensions that > >>> commonly matches the content you want to track. Of course, > >>> the file name does not directly correspond to the content, > >>> but it often gives a good hint to search for *.wmv, *.flv, > >>> *.avi, *.mp(e)g, *.mp3, *.wma, *.exe - and of course all > >>> the variations of the extensions with uppercase letters. > >>> Also consider *.rar and maybe *.zip for compressed content. > >>> > >>> If file extensions have been manipulated (rare case), the > >>> "file" command can still identify the correct file type. > >>> > >>> > >>> > >>> > >> yes thanks , gonna try with the file command > > You could make a simple script that lists "file" output for > > all files (just to be sure because of possible suffix renaming) > > for further inspection. Sometimes, you can also run "strings" > > for a given file - maybe that can be used to identify typical > > suspicious string patters for a "strings + grep" combination > > so less manual identification has to be done. > > > > > > yes , my main problem is the huge number of files > but anyway I'm gonna first check files greater than 500 Mb > it could be a good start That's what 'find(1)' is for. Something like (run as superuser): find / -exec ./inspect {} >> /tmp/suspects \; with './inspect' being a trivial (executable!) shell-script: #!/bin/sh file $1 | awk -f ./inspect.awk and './inspect.awk' is: {file = $1 ; $1 = "";} /regex1/ {printf("%s %s\n",file,$0;next); /regex2/ {printf("%s %s\n",file,$0;next); /regex3/ {printf("%s %s\n",file,$0;next); ... ... ... ... {next;} where 'regex1', 'regex2', etc. are things to select 'files' of interest, based on what 'file' reports. The awk code strips out the file name, so that the regex will match only against the 'file' output, with no false- Positives against a substring in the file name itself. See the find(1) manpage for things you can put before the '-exec' param, to filter by size, etc. You can also limit the search to a specific part of the filesystem tree, by replacing '/' with the name of the directory hierarchy you want to search -- e.g. '/home' (if that's where all 'user' files are) -- although, 'for completeness' (given the 'legal" issues) you may well want to run it over 'everything'.