From owner-svn-src-all@FreeBSD.ORG Sun May 8 17:40:11 2011 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F02071065677; Sun, 8 May 2011 17:40:10 +0000 (UTC) (envelope-from jilles@FreeBSD.org) Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:4f8:fff6::2c]) by mx1.freebsd.org (Postfix) with ESMTP id D6CD38FC1F; Sun, 8 May 2011 17:40:10 +0000 (UTC) Received: from svn.freebsd.org (localhost [127.0.0.1]) by svn.freebsd.org (8.14.4/8.14.4) with ESMTP id p48HeAT3056137; Sun, 8 May 2011 17:40:10 GMT (envelope-from jilles@svn.freebsd.org) Received: (from jilles@localhost) by svn.freebsd.org (8.14.4/8.14.4/Submit) id p48HeAoK056129; Sun, 8 May 2011 17:40:10 GMT (envelope-from jilles@svn.freebsd.org) Message-Id: <201105081740.p48HeAoK056129@svn.freebsd.org> From: Jilles Tjoelker Date: Sun, 8 May 2011 17:40:10 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: Subject: svn commit: r221669 - in head: bin/sh tools/regression/bin/sh/parser X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 May 2011 17:40:11 -0000 Author: jilles Date: Sun May 8 17:40:10 2011 New Revision: 221669 URL: http://svn.freebsd.org/changeset/base/221669 Log: sh: Add \u/\U support (in $'...') for UTF-8. Because we have no iconv in base, support for other charsets is not possible. Note that \u/\U are processed using the locale that was active when the shell started. This is necessary to avoid behaviour that depends on the parse/execute split (for example when placing braces around an entire script). Therefore, UTF-8 encoding is implemented manually. Added: head/tools/regression/bin/sh/parser/dollar-quote10.0 (contents, props changed) head/tools/regression/bin/sh/parser/dollar-quote11.0 (contents, props changed) Modified: head/bin/sh/main.c head/bin/sh/parser.c head/bin/sh/sh.1 head/bin/sh/var.c head/bin/sh/var.h Modified: head/bin/sh/main.c ============================================================================== --- head/bin/sh/main.c Sun May 8 16:15:50 2011 (r221668) +++ head/bin/sh/main.c Sun May 8 17:40:10 2011 (r221669) @@ -76,7 +76,7 @@ __FBSDID("$FreeBSD$"); int rootpid; int rootshell; struct jmploc main_handler; -int localeisutf8; +int localeisutf8, initial_localeisutf8; static void read_profile(const char *); static char *find_dot_file(char *); @@ -97,7 +97,7 @@ main(int argc, char *argv[]) char *shinit; (void) setlocale(LC_ALL, ""); - updatecharset(); + initcharset(); state = 0; if (setjmp(main_handler.loc)) { switch (exception) { Modified: head/bin/sh/parser.c ============================================================================== --- head/bin/sh/parser.c Sun May 8 16:15:50 2011 (r221668) +++ head/bin/sh/parser.c Sun May 8 17:40:10 2011 (r221669) @@ -1219,6 +1219,29 @@ readcstyleesc(char *out) if (v == 0 || (v >= 0xd800 && v <= 0xdfff)) synerror("Bad escape sequence"); /* We really need iconv here. */ + if (initial_localeisutf8 && v > 127) { + CHECKSTRSPACE(4, out); + /* + * We cannot use wctomb() as the locale may have + * changed. + */ + if (v <= 0x7ff) { + USTPUTC(0xc0 | v >> 6, out); + USTPUTC(0x80 | (v & 0x3f), out); + return out; + } else if (v <= 0xffff) { + USTPUTC(0xe0 | v >> 12, out); + USTPUTC(0x80 | ((v >> 6) & 0x3f), out); + USTPUTC(0x80 | (v & 0x3f), out); + return out; + } else if (v <= 0x10ffff) { + USTPUTC(0xf0 | v >> 18, out); + USTPUTC(0x80 | ((v >> 12) & 0x3f), out); + USTPUTC(0x80 | ((v >> 6) & 0x3f), out); + USTPUTC(0x80 | (v & 0x3f), out); + return out; + } + } if (v > 127) v = '?'; break; Modified: head/bin/sh/sh.1 ============================================================================== --- head/bin/sh/sh.1 Sun May 8 16:15:50 2011 (r221668) +++ head/bin/sh/sh.1 Sun May 8 17:40:10 2011 (r221669) @@ -463,8 +463,8 @@ The Unicode code point (eight hexadecimal digits) .El .Pp -The sequences for Unicode code points currently only provide useful results -for values below 128. +The sequences for Unicode code points are currently only useful with +UTF-8 locales. They reject code point 0 and UTF-16 surrogates. .Pp If an escape sequence would produce a byte with value 0, Modified: head/bin/sh/var.c ============================================================================== --- head/bin/sh/var.c Sun May 8 16:15:50 2011 (r221668) +++ head/bin/sh/var.c Sun May 8 17:40:10 2011 (r221669) @@ -517,6 +517,13 @@ updatecharset(void) localeisutf8 = !strcmp(charset, "UTF-8"); } +void +initcharset(void) +{ + updatecharset(); + initial_localeisutf8 = localeisutf8; +} + /* * Generate a list of exported variables. This routine is used to construct * the third argument to execve when executing a program. Modified: head/bin/sh/var.h ============================================================================== --- head/bin/sh/var.h Sun May 8 16:15:50 2011 (r221668) +++ head/bin/sh/var.h Sun May 8 17:40:10 2011 (r221669) @@ -83,6 +83,8 @@ extern struct var vterm; #endif extern int localeisutf8; +/* The parser uses the locale that was in effect at startup. */ +extern int initial_localeisutf8; /* * The following macros access the values of the above variables. @@ -116,6 +118,7 @@ char *bltinlookup(const char *, int); void bltinsetlocale(void); void bltinunsetlocale(void); void updatecharset(void); +void initcharset(void); char **environment(void); int showvarscmd(int, char **); int exportcmd(int, char **); Added: head/tools/regression/bin/sh/parser/dollar-quote10.0 ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ head/tools/regression/bin/sh/parser/dollar-quote10.0 Sun May 8 17:40:10 2011 (r221669) @@ -0,0 +1,10 @@ +# $FreeBSD$ + +# a umlaut +s=$(printf '\303\244') +# euro sign +s=$s$(printf '\342\202\254') + +# Start a new shell so the locale change is picked up. +ss="$(LC_ALL=en_US.UTF-8 ${SH} -c "printf %s \$'\u00e4\u20ac'")" +[ "$s" = "$ss" ] Added: head/tools/regression/bin/sh/parser/dollar-quote11.0 ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ head/tools/regression/bin/sh/parser/dollar-quote11.0 Sun May 8 17:40:10 2011 (r221669) @@ -0,0 +1,8 @@ +# $FreeBSD$ + +# some sort of 't' outside BMP +s=$s$(printf '\360\235\225\245') + +# Start a new shell so the locale change is picked up. +ss="$(LC_ALL=en_US.UTF-8 ${SH} -c "printf %s \$'\U0001d565'")" +[ "$s" = "$ss" ]