Date: Tue, 28 Dec 2010 17:57:52 GMT From: Mathieu <sigsys@gmail.com> To: freebsd-gnats-submit@FreeBSD.org Subject: bin/153502: regex(3) bug with UTF-8 locale Message-ID: <201012281757.oBSHvqcr022002@red.freebsd.org> Resent-Message-ID: <201012281800.oBSI0WhM080468@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 153502 >Category: bin >Synopsis: regex(3) bug with UTF-8 locale >Confidential: no >Severity: serious >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Tue Dec 28 18:00:32 UTC 2010 >Closed-Date: >Last-Modified: >Originator: Mathieu >Release: 8.1-STABLE, 7.3-RELEASE-p3 >Organization: >Environment: 8.1-STABLE/amd64 r212312M 7.3-RELEASE-p3/i386 r215233M >Description: I'm seeing odd behavior from programs using regex(3) like less(1), vi(1) and sed(1) when using LANG=en_US.UTF-8 and UTF-8 inputs. Sometimes it seems to work right: $ echo 'é' | sed -ne '/^.$/p' é $ echo 'éé' | sed -ne '/^..$/p' éé $ echo 'aéa' | sed -ne '/a.a/p' aéa $ echo 'aéa' | sed -ne '/a.*a/p' aéa $ echo 'aaéaa' | sed -ne '/aa.aa/p' aaéaa $ echo 'aéaéa' | sed -ne '/a.a.a/p' aéaéa But not always: $ echo 'éa' | sed -ne '/.a/p' $ echo 'aéaa' | sed -ne '/a.aa/p' $ echo 'éaé' | sed -ne '/.a./p' Seems like using ".*", ".+", ".{0,}" or ".{1,}" works right, but ".{0,1}", ".{1,1}" or a lone "." doesn't always. >How-To-Repeat: >Fix: >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201012281757.oBSHvqcr022002>