From owner-freebsd-current Tue Jan 4 23: 2:25 2000 Delivered-To: freebsd-current@freebsd.org Received: from alcor.twinsun.com (alcor.twinsun.com [198.147.65.9]) by hub.freebsd.org (Postfix) with ESMTP id 6241A1531A; Tue, 4 Jan 2000 23:02:18 -0800 (PST) (envelope-from eggert@twinsun.com) Received: from set.twinsun.com ([192.54.239.8]) by alcor.twinsun.com (8.9.3/8.9.3) with ESMTP id XAA12071; Tue, 4 Jan 2000 23:01:10 -0800 (PST) Received: (eggert@localhost) by set.twinsun.com (8.9.3+Sun/8.9.3) id WAA10612; Tue, 4 Jan 2000 22:59:34 -0800 (PST) Date: Tue, 4 Jan 2000 22:59:34 -0800 (PST) From: Paul Eggert Message-Id: <200001050659.WAA10612@set.twinsun.com> To: ru@freebsd.org Cc: obrien@NUXI.com, alainm@rcsm.ece.mcgill.ca, freebsd-current@freebsd.org, bug-gnu-utils@gnu.org In-reply-to: <20000104143320.D10970@relay.ucb.crimea.ua> (ru@freebsd.org) Subject: Re: Bad 'grep' behaviour in -CURRENT, faulty binary detection? References: <20000104143320.D10970@relay.ucb.crimea.ua> Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Thanks for implementing the result of our earlier discussion. I documented what you wrote, and as a result I have the following suggestions for further improvements: * The assignment to not_text should depend on out_quiet if no --binary-files option is given. * The operands of --binary-files are confusing. E.g. --binary-files=read implies that --binary-options=summarize and --binary-options=skip don't read binary files, which isn't true. And --binary-options=skip implies that binary files aren't mentioned on output, which also isn't true (a counterexample is --files-without-match --binary-options=skip). I propose that the operands be renamed from (read, skip, summarize) to (text, without-match, binary). * There's no need for a short equivalent to the --binary-files option. * Is the -I option really needed? Can't users just use --binary-files=without-match instead? I have the impression that this option won't be typed interactively much. Could you please try the patch enclosed below instead? It implements all the above suggestions. 2000-01-04 Paul Eggert Add --binary-files option. * NEWS, doc/grep.1, doc/grep.texi: Document it. * src/grep.c (BINARY_FILES_OPTION): New constant. (long_options, grep, usage, main): New --binary-files option. (binary_files): New var. * src/system.h (TYPE_SIGNED, TYPE_MINIMUM, TYPE_MAXIMUM, CHAR_MAX): New macros. (INT_MAX, UCHAR_MAX): Define in terms of TYPE_MAXIMUM. =================================================================== RCS file: doc/grep.1,v retrieving revision 2.4 retrieving revision 2.4.0.1 diff -pu -r2.4 -r2.4.0.1 --- doc/grep.1 1999/10/12 20:41:22 2.4 +++ doc/grep.1 2000/01/05 06:35:00 2.4.0.1 @@ -28,6 +28,7 @@ grep, egrep, fgrep \- print lines matchi .IR FILE ] .RB [ \-d .IR ACTION ] +.RB [ \-\^\-binary-files=\fITYPE\fP ] .RB [ \-\^\-directories=\fIACTION\fP ] .RB [ \-\^\-extended-regexp ] .RB [ \-\^\-fixed-strings ] @@ -143,6 +144,41 @@ Print the version number of to standard error. This version number should be included in all bug reports (see below). .TP +.BI \-\^\-binary-files= TYPE +If the first few bytes of a file indicate that the file contains binary +data, assume that the file is of type +.IR TYPE . +By default, +.I TYPE +is +.BR binary , +and +.B grep +normally outputs either +a one-line message saying that a binary file matches, or no message if +there is no match. +If +.I TYPE +is +.BR without-match , +.B grep +assumes that a binary file does not match. +If +.I TYPE +is +.BR text , +.B grep +processes a binary file as if it were text; this is equivalent to the +.B \-a +or +.B \-\^\-text +option. +.I Warning: +.B "grep \-\^\-binary-files=text" +might output binary garbage, +which can have nasty side effects if the output is a terminal and if the +terminal driver interprets some of it as commands. +.TP .BR \-b ", " \-\^\-byte-offset Print the byte offset within the input file before each line of output. @@ -257,15 +293,9 @@ and and should redirect output to /dev/null instead. .TP .BR \-a ", " \-\^\-text -Do not suppress output lines that contain binary data. -Normally, if the first few bytes of a file indicate that -the file contains binary data, -.B grep -outputs only a message saying that the file matches the pattern. -This option causes -.B grep -to act as if the file is a text file, -even if it would otherwise be treated as binary. +Process a binary file as if it were text; this is equivalent to the +.B \-\^\-binary-files=text +option. .TP .BR \-v ", " \-\^\-invert-match Invert the sense of matching, to select non-matching lines. @@ -323,9 +353,9 @@ system call to read input, instead of the default .BR read (2) system call. In some situations, -.B -\^-mmap +.B \-\^\-mmap yields better performance. However, -.B -\^-mmap +.B \-\^\-mmap can cause undefined behavior (including core dumps) if an input file shrinks while .B grep =================================================================== RCS file: doc/grep.texi,v retrieving revision 2.4 retrieving revision 2.4.0.1 diff -pu -r2.4 -r2.4.0.1 --- doc/grep.texi 1999/11/13 21:20:24 2.4 +++ doc/grep.texi 2000/01/05 06:35:00 2.4.0.1 @@ -275,6 +275,21 @@ This version number should be included i Print a usage message briefly summarizing these command-line options and the bug-reporting address, then exit. +@itemx --binary-files=@var{type} +@opindex --binary-files +@cindex binary files +If the first few bytes of a file indicate that the file contains binary +data, assume that the file is of type @var{type}. By default, +@var{type} is @samp{binary}, and @command{grep} normally outputs either +a one-line message saying that a binary file matches, or no message if +there is no match. If @var{type} is @samp{without-match}, +@command{grep} assumes that a binary file does not match. If @var{type} +is @samp{text}, @command{grep} processes a binary file as if it were +text; this is equivalent to the @samp{-a} or @samp{--text} option. +@emph{Warning:} @samp{--binary-files=text} might output binary garbage, +which can have nasty side effects if the output is a terminal and if the +terminal driver interprets some of it as commands. + @item -b @itemx --byte-offset @opindex -b @@ -329,16 +344,8 @@ The scanning of every file will stop on @opindex --text @cindex suppress binary data @cindex binary files -Do not suppress output lines that contain binary data. -Normally, if the first few bytes of a file indicate -that the file contains binary data, grep outputs only a -message saying that the file matches the pattern. This -option causes grep to act as if the file is a text -file, even if it would otherwise be treated as binary. -@emph{Warning:} the result might be binary garbage -printed to the terminal, which can have nasty -side-effects if the terminal driver interprets some of -it as commands. +Process a binary file as if it were text; this is equivalent to the +@samp{--binary-files=text} option. @item -w @itemx --word-regexp =================================================================== RCS file: RCS/NEWS,v retrieving revision 2.4 retrieving revision 2.4.0.1 diff -pu -r2.4 -r2.4.0.1 --- NEWS 1999/11/08 00:42:29 2.4 +++ NEWS 2000/01/05 06:35:00 2.4.0.1 @@ -1,3 +1,10 @@ + - The new option --binary-files=TYPE makes grep assume that a binary input + file is of type TYPE. + --binary-files='binary' (the default) outputs a 1-line summary of matches. + --binary-files='without-match' assumes binary files do not match. + --binary-files='text' treats binary files as text + (equivalent to -a or --text). + Version 2.4: - egrep is now equivalent to `grep -E' as required by POSIX, =================================================================== RCS file: src/grep.c,v retrieving revision 2.4 retrieving revision 2.4.0.1 diff -pu -r2.4 -r2.4.0.1 --- src/grep.c 1999/11/13 16:31:55 2.4 +++ src/grep.c 2000/01/05 06:35:00 2.4.0.1 @@ -62,12 +62,19 @@ static int mmap_option; static char const short_options[] = "0123456789A:B:C::EFGHUVX:abcd:e:f:hiLlnqrsuvwxyZz"; +/* Non-boolean long options that have no corresponding short equivalents. */ +enum +{ + BINARY_FILES_OPTION = CHAR_MAX + 1 +}; + /* Long options equivalences. */ static struct option long_options[] = { {"after-context", required_argument, NULL, 'A'}, {"basic-regexp", no_argument, NULL, 'G'}, {"before-context", required_argument, NULL, 'B'}, + {"binary-files", required_argument, NULL, BINARY_FILES_OPTION}, {"byte-offset", no_argument, NULL, 'b'}, {"context", optional_argument, NULL, 'C'}, {"count", no_argument, NULL, 'c'}, @@ -476,7 +483,12 @@ fillbuf (save, stats) } /* Flags controlling the style of output. */ -static int always_text; /* Assume the input is always text. */ +static enum + { + BINARY_BINARY_FILES, + TEXT_BINARY_FILES, + WITHOUT_MATCH_BINARY_FILES + } binary_files; /* How to handle binary files. */ static int filename_mask; /* If zero, output nulls after filenames. */ static int out_quiet; /* Suppress all normal output. */ static int out_invert; /* Print nonmatching stuff. */ @@ -729,11 +741,14 @@ grep (fd, file, stats) { if (! (is_EISDIR (errno, file) && suppress_errors)) error (filename, errno); - return nlines; + return 0; } - not_text = (! (always_text | out_quiet) + not_text = (((binary_files == BINARY_BINARY_FILES && !out_quiet) + || binary_files == WITHOUT_MATCH_BINARY_FILES) && memchr (bufbeg, eol ? '\0' : '\200', buflim - bufbeg)); + if (not_text && binary_files == WITHOUT_MATCH_BINARY_FILES) + return 0; done_on_match += not_text; out_quiet += not_text; @@ -993,7 +1008,9 @@ Output control:\n\ -H, --with-filename print the filename for each match\n\ -h, --no-filename suppress the prefixing filename on output\n\ -q, --quiet, --silent suppress all normal output\n\ - -a, --text do not suppress binary output\n\ + -a, --text equivalent to --binary-files=text\n\ + --binary-files=TYPE assume that binary files are TYPE\n\ + TYPE is 'binary', 'text', or 'without-match'.\n\ -d, --directories=ACTION how to handle directories\n\ ACTION is 'read', 'recurse', or 'skip'.\n\ -r, --recursive equivalent to --directories=recurse.\n\ @@ -1276,7 +1293,7 @@ main (argc, argv) setmatcher (optarg); break; case 'a': - always_text = 1; + binary_files = TEXT_BINARY_FILES; break; case 'b': out_byte = 1; @@ -1370,6 +1387,16 @@ main (argc, argv) case 'z': eolbyte = '\0'; break; + case BINARY_FILES_OPTION: + if (strcmp (optarg, "binary") == 0) + binary_files = BINARY_BINARY_FILES; + else if (strcmp (optarg, "text") == 0) + binary_files = TEXT_BINARY_FILES; + else if (strcmp (optarg, "without-match") == 0) + binary_files = WITHOUT_MATCH_BINARY_FILES; + else + fatal (_("unknown binary-files type"), 0); + break; case 0: /* long options */ break; =================================================================== RCS file: src/system.h,v retrieving revision 2.4 retrieving revision 2.4.0.1 diff -pu -r2.4 -r2.4.0.1 --- src/system.h 1999/10/12 20:41:25 2.4 +++ src/system.h 2000/01/05 06:35:00 2.4.0.1 @@ -127,11 +127,20 @@ void free(); #ifndef CHAR_BIT # define CHAR_BIT 8 #endif +/* The extra casts work around common compiler bugs. */ +#define TYPE_SIGNED(t) (! ((t) 0 < (t) -1)) +#define TYPE_MINIMUM(t) ((t) (TYPE_SIGNED (t) \ + ? ~ (t) 0 << (sizeof (t) * CHAR_BIT - 1) \ + : (t) 0)) +#define TYPE_MAXIMUM(t) ((t) (~ (t) 0 - TYPE_MINIMUM (t))) +#ifndef CHAR_MAX +# define CHAR_MAX TYPE_MAXIMUM (char) +#endif #ifndef INT_MAX -# define INT_MAX 2147483647 +# define INT_MAX TYPE_MAXIMUM (int) #endif #ifndef UCHAR_MAX -# define UCHAR_MAX 255 +# define UCHAR_MAX TYPE_MAXIMUM (unsigned char) #endif #if !defined(STDC_HEADERS) && defined(HAVE_STRING_H) && defined(HAVE_MEMORY_H) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message