Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 30 Jun 2011 18:33:34 +0200
From:      Stefan Esser <se@freebsd.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        Alexander Best <arundel@freebsd.org>, Poul-Henning Kamp <phk@phk.freebsd.dk>, standards@freebsd.org, Bruce Evans <bde@freebsd.org>
Subject:   RESENT with patch ... Re: [RFC] Consistent numeric range for "expr" on all architectures
Message-ID:  <4E0CA55E.2090102@freebsd.org>
In-Reply-To: <4E0C2F41.2060302@freebsd.org>
References:  <99048.1309258976@critter.freebsd.dk> <4E0A0774.3090004@freebsd.org> <20110629082103.O1084@besplex.bde.org> <4E0B1C47.4010201@freebsd.org> <20110630073705.P1117@besplex.bde.org> <4E0C2F41.2060302@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format.
--------------070607050902030603000606
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

The attached patch takes the feedback from Bruce into account.
See details inline in the quoted mail, below.

Any further comments?

On 30.06.2011 10:09, Stefan Esser wrote:
> On 30.06.2011 01:32, Bruce Evans wrote:
>> On Wed, 29 Jun 2011, Stefan Esser wrote:
>>> Am 29.06.2011 01:06, schrieb Bruce Evans:
>>>> Other points:
>>>> - `expr -e 10000000000000000000 + 0' (19 zeros) gives "Result too large",
>>>>   but it isn't the result that is too large, but the arg that is too large.
>>>>   This message is strerror(ERANGE) after strtoimax() sets errno to ERANGE.
>>>>   `expr -e 1000000000000000000 \* 10' gives "overflow".  This message is
>>>>   correct, but it is in a different style to strerror() (uncapitalized,
>>>>   and more concise).
>>>
>>> The patch that I sent with my first message fixes this. The message is
>>> changed from "Result too large" to "Not a valid integer: %s".
>>>
>>> ("non-numeric argument" is used in other places and I could adapt to
>>> that, though I prefer to also see which argument was rejected. But I
>>> think that "not a valid integer" better describes the situation.)
>>
>> I prefer "operand too large".  A decimal integer operand that is too
>> large to be represented by intmax_t is not really invalid, but is just
>> too large.  We already have a different message for invalid.
> 
> Yes, I also thought the "non-numeric" message was misleading and that
> "not a valid integer" was just slightly less misleading ;-)
> 
> But "operand too large" correctly describes the situation and I'll just
> add a test for that case (instead of just relying on strtoimax() to decide
> about integer-ness).

The patch introduces a function is_integer(), which performs a simple
syntax check on operands to determine whether they are decimals. This
function replaces the use of strtoimax() for the purpose of testing the
operand, which fails to accept decimals too long to be represented by
intmax_t.

> According to POSIX, "+1" is no valid integer operand in "expr", should a
> leading "+" also be detected and rejected?

The is_integer() function only permits an optional leading minus sign
followed by digits. The case of a only a minus sign passed as operand is
already caught by the parser.
(Aside: Is it correct to reject "expr -- - : -" as a syntax error?)
>>> From an old (?) version of the man page:
>>
>> %    Arithmetic operations are performed using signed integer math.  If the -e
>> %    flag is specified, arithmetic uses the C intmax_t data type (the largest
>>                         ^^^^^^^^^^ actual arithmentic, not just parsing; but
>>                                    for expr [-e] 10000000000000000000 we are
>>                                    doing actual arithmetic -- see below 
>> %    integral type available), and expr will detect arithmetic overflow and
>> %    return an error indication.  If a numeric operand is specified which is
>>
>> "Numeric" is not defined anywhere in the man page.  This is the only
>> use of it in the man page.  It means "decimal integer" and should say
>> precisely that.  The only hint about this in the man page is the statement
>> that "all integer operands are interpreted in base 10".  The fuzziness
>> extends to error messages saying "non-numeric argument" instead of
>> "operand not a decimal integer [used in integer context]".
> 
> Ok, all numeric arguments should be called "decimal integer" in the
> documentation and error messages (if applicable).

I have changed the wording in the man page. Critical review is welcome.

>> %    so large as to overflow conversion to an integer, it is parsed as a
>> %    string instead.  If -e is not specified, arithmetic operations and pars-
>>
>> This specificially says that large operands are parsed as strings.
>> Strangely, since large operands are only checked for with -e, only -e can
>> get this right; without -e, large operands are not even detected.
>> However, this is a bug in the man page -- see below.
> 
> I had already wondered about this case (large numbers being considered
> strings), but I thought it was in compliance with POSIX. This will be
> fixed by checking operands to be POSIX decimal integers.

Both program and man page are fixed.

[...]
>>   Therefore, "expr -e 1000000000000000000" is not a syntax error as seems
>>   to be required by the man page; the arg in it forms a degenerate
>>   expression.  The arg is not a <string> since it is an <integer>.
>>   Therefore, the expression is numeric (I didn't check that POSIX says
>>   this explicitly).  Therefore, we are justified in applying strtoimax()
>>   to all the operands in the expression (all 1 of them) and getting a
>>   range error.

The current behaviour was irritating in so far as the first the operand
was printed, then an error message on the next line, and an exit code of
2 is returned. The error message comes from the check for 0 or "", which
shall result in an exit code of 1, but for too large decimals an
overflow is detected during these tests.

This is fixed by first checking for overflow and then printing the
result of the evaluated expression.

A side effect is that "expr 0000000000010000000" now correctly prints
"10000000" instead of "0000000000010000000".

> I think this is rational behaviour of the code, but I'm not convinced,
> that <identifier> can not be <integer> and <string> at the same time.
> See for example:
> 
> $ expr \( 222 \) : "2*"
> 3
> $ expr \( 111 + 111 \) : "2*"
> 3
> 
> The numeric result is taken as string to perform the RE matching and the
> length of the first match is returned. Therefore, a clear integer can
> still be interpreted as string, if operand to a string expression.
> The degenerate case of just a too long integer might still be considered
> a valid string and the missing operation (or identity?) does not change
> that type. Therefore it might be argued, that a very long numeric
> argument should just be printed as string in that case. (???)

I have had a look at the POSIX page for EXRP and found that I had a
wrong understanding of some terms. Each operand is either an integer or
a string, but the result of an evaluation can be converted to either.

Thus it is correct to enforce numeric interpretation on input that is
syntactically parsed as integer number.

>> The syntax is still broken as designed, since it doesn't allow +1 to
>> be an integer, and it requires octal intgers to be misinterpreted
>> as decimal integers although no reasonable specification of decimal
>> integers allows them to start with a '0', and it doesn't support
>> non-decimal integers...
> 
> We *do* accept "+1" as synonymous to "1", though, as pointed out above.
> I'd like to change this.

Changed, as explained above.

>>> Does POSIX require that expr exits on illegal arguments?
>>
>> Not sure.  It requires an exit status of 2 for invalid expressions.
>> For the "+" operator, the operands are required to be decimal
>> integers, but the error handling isn't so clearly specified.  For
>> the "&" operator, operands(s) are allowed to be null.

The patch I prepared supports empty operands for arithmetic operations,
they are treated as "0":

$ expr "" + 1
1

This is easily changed by enforcing at least one digit in is_integer().
(I have tested both versions, but since this appears to be traditional
behaviour at least on FreeBSD, I'm worried that enforcing at least one
digit for numeric operands will break existing shell scripts.)

Please review the attached patch. It is again created with "-w" to
ignore whitespace changes (indentation).

Regards, STefan

--------------070607050902030603000606
Content-Type: text/plain;
 name="expr.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="expr.diff"

--- expr.y~
+++ expr.y	2011-06-30 17:44:35.914960240 +0200
@@ -48,6 +48,7 @@
 int		chk_times(intmax_t, intmax_t, intmax_t);
 void		free_value(struct val *);
 int		is_zero_or_null(struct val *);
+int		is_integer(const char *);
 int		isstring(struct val *);
 struct val	*make_integer(intmax_t);
 struct val	*make_str(const char *);
@@ -65,13 +66,12 @@
 struct val	*op_plus(struct val *, struct val *);
 struct val	*op_rem(struct val *, struct val *);
 struct val	*op_times(struct val *, struct val *);
-intmax_t	to_integer(struct val *);
+int		to_integer(struct val *);
 void		to_string(struct val *);
 int		yyerror(const char *);
 int		yylex(void);
 int		yyparse(void);
 
-static int	eflag;
 char **av;
 %}
 
@@ -134,37 +134,15 @@
 make_str(const char *s)
 {
 	struct val *vp;
-	char *ep;
 
 	vp = (struct val *) malloc (sizeof (*vp));
 	if (vp == NULL || ((vp->u.s = strdup (s)) == NULL)) {
 		errx(ERR_EXIT, "malloc() failed");
 	}
-
-	/*
-	 * Previously we tried to scan the string to see if it ``looked like''
-	 * an integer (erroneously, as it happened).  Let strtoimax() do the
-	 * dirty work.  We could cache the value, except that we are using
-	 * a union and need to preserve the original string form until we
-	 * are certain that it is not needed.
-	 *
-	 * IEEE Std.1003.1-2001 says:
-	 * /integer/ An argument consisting only of an (optional) unary minus  
-	 *	     followed by digits.          
-	 *
-	 * This means that arguments which consist of digits followed by
-	 * non-digits MUST NOT be considered integers.  strtoimax() will
-	 * figure this out for us.
-	 */
-	if (eflag)
-		(void)strtoimax(s, &ep, 10);
+	if (is_integer(s))
+		vp->type = numeric_string;
 	else
-		(void)strtol(s, &ep, 10);
-
-	if (*ep != '\0')
 		vp->type = string;
-	else	
-		vp->type = numeric_string;
 
 	return vp;
 }
@@ -178,7 +156,7 @@
 }
 
 
-intmax_t
+int
 to_integer(struct val *vp)
 {
 	intmax_t i;
@@ -191,13 +169,9 @@
 
 	/* vp->type == numeric_string, make it numeric */
 	errno = 0;
-	if (eflag) {
-		i  = strtoimax(vp->u.s, (char **)NULL, 10);
-		if (errno == ERANGE)
-			err(ERR_EXIT, NULL);
-	} else {
-		i = strtol(vp->u.s, (char **)NULL, 10);
-	}
+	i  = strtoimax(vp->u.s, (char **)NULL, 10);
+	if (errno == ERANGE)
+		errx(ERR_EXIT, "operand too large: '%s'", vp->u.s);
 
 	free (vp->u.s);
 	vp->u.i = i;
@@ -230,6 +204,17 @@
 
 
 int
+is_integer(const char *s)
+{
+	if (*s == '-')
+		s++;
+	while (isdigit(*s))
+		s++;
+	return *s == '\0';
+}
+
+
+int
 isstring(struct val *vp)
 {
 	/* only TRUE if this string is not a valid integer */
@@ -282,12 +267,11 @@
 	if (getenv("EXPR_COMPAT") != NULL
 	    || check_utility_compat("expr")) {
 		av = argv + 1;
-		eflag = 1;
 	} else {
 		while ((c = getopt(argc, argv, "e")) != -1)
 			switch (c) {
 			case 'e':
-				eflag = 1;
+				/* ignore for backwards compatibility */
 				break;
 
 			default:
@@ -300,6 +284,8 @@
 
 	yyparse();
 
+	/* convert type numeric_string to integer; NOP for other types */
+	(void)to_integer(result);
 	if (result->type == integer)
 		printf("%jd\n", result->u.i);
 	else
@@ -483,13 +469,10 @@
 		errx(ERR_EXIT, "non-numeric argument");
 	}
 
-	if (eflag) {
-		r = make_integer(a->u.i + b->u.i);
-		if (chk_plus(a->u.i, b->u.i, r->u.i)) {
-			errx(ERR_EXIT, "overflow");
-		}
-	} else
-		r = make_integer((long)a->u.i + (long)b->u.i);
+	r = make_integer(a->u.i + b->u.i);
+	if (chk_plus(a->u.i, b->u.i, r->u.i)) {
+		errx(ERR_EXIT, "overflow");
+	}
 
 	free_value (a);
 	free_value (b);
@@ -520,13 +503,10 @@
 		errx(ERR_EXIT, "non-numeric argument");
 	}
 
-	if (eflag) {
-		r = make_integer(a->u.i - b->u.i);
-		if (chk_minus(a->u.i, b->u.i, r->u.i)) {
-			errx(ERR_EXIT, "overflow");
-		}
-	} else
-		r = make_integer((long)a->u.i - (long)b->u.i);
+	r = make_integer(a->u.i - b->u.i);
+	if (chk_minus(a->u.i, b->u.i, r->u.i)) {
+		errx(ERR_EXIT, "overflow");
+	}
 
 	free_value (a);
 	free_value (b);
@@ -554,13 +534,10 @@
 		errx(ERR_EXIT, "non-numeric argument");
 	}
 
-	if (eflag) {
-		r = make_integer(a->u.i * b->u.i);
-		if (chk_times(a->u.i, b->u.i, r->u.i)) {
-			errx(ERR_EXIT, "overflow");
-		}
-	} else
-		r = make_integer((long)a->u.i * (long)b->u.i);
+	r = make_integer(a->u.i * b->u.i);
+	if (chk_times(a->u.i, b->u.i, r->u.i)) {
+		errx(ERR_EXIT, "overflow");
+	}
 
 	free_value (a);
 	free_value (b);
@@ -591,13 +568,10 @@
 		errx(ERR_EXIT, "division by zero");
 	}
 
-	if (eflag) {
-		r = make_integer(a->u.i / b->u.i);
-		if (chk_div(a->u.i, b->u.i)) {
-			errx(ERR_EXIT, "overflow");
-		}
-	} else
-		r = make_integer((long)a->u.i / (long)b->u.i);
+	r = make_integer(a->u.i / b->u.i);
+	if (chk_div(a->u.i, b->u.i)) {
+		errx(ERR_EXIT, "overflow");
+	}
 
 	free_value (a);
 	free_value (b);
@@ -617,11 +591,8 @@
 		errx(ERR_EXIT, "division by zero");
 	}
 
-	if (eflag)
-		r = make_integer(a->u.i % b->u.i);
-	        /* chk_rem necessary ??? */
-	else
-		r = make_integer((long)a->u.i % (long)b->u.i);
+	r = make_integer(a->u.i % b->u.i);
+	/* chk_rem necessary ??? */
 
 	free_value (a);
 	free_value (b);
--- expr.1~
+++ expr.1	2011-06-30 16:06:50.338086000 +0200
@@ -50,25 +50,19 @@
 All operators and operands must be passed as separate arguments.
 Several of the operators have special meaning to command interpreters
 and must therefore be quoted appropriately.
-All integer operands are interpreted in base 10.
+All integer operands are interpreted in base 10 and must only consist
+of an optional leading minus sign followed by one or more digits.
 .Pp
-Arithmetic operations are performed using signed integer math.
-If the
-.Fl e
-flag is specified, arithmetic uses the C
+Arithmetic operations are performed using signed integer math with a
+range according to the C
 .Vt intmax_t
-data type (the largest integral type available), and
-.Nm
-will detect arithmetic overflow and return an error indication.
-If a numeric operand is specified which is so large as to overflow
-conversion to an integer, it is parsed as a string instead.
-If
+data type (the largest signed integral type available). All conversions and
+operations are checked for overflow, which will result in program termination
+with an error message on stdout and with error status being returned.
+.Pp
+The
 .Fl e
-is not specified, arithmetic operations and parsing of integer
-arguments will overflow silently according to the rules of the C
-standard, using the
-.Vt long
-data type.
+option is accepted for backwards compatibility but has no effect.
 .Pp
 Operators are listed below in order of increasing precedence; all
 are left-associative.
@@ -272,6 +266,8 @@
 utility conforms to
 .St -p1003.1-2001 ,
 provided that compatibility mode is not enabled.
+Extended arithmetic range and overflow checks apply to cases of
+undefined behavior according to POSIX.
 The
 .Fl e
 flag is an extension.

--------------070607050902030603000606--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E0CA55E.2090102>