Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 Dec 2010 17:57:52 GMT
From:      Mathieu <sigsys@gmail.com>
To:        freebsd-gnats-submit@FreeBSD.org
Subject:   bin/153502: regex(3) bug with UTF-8 locale
Message-ID:  <201012281757.oBSHvqcr022002@red.freebsd.org>
Resent-Message-ID: <201012281800.oBSI0WhM080468@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

>Number:         153502
>Category:       bin
>Synopsis:       regex(3) bug with UTF-8 locale
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Dec 28 18:00:32 UTC 2010
>Closed-Date:
>Last-Modified:
>Originator:     Mathieu
>Release:        8.1-STABLE, 7.3-RELEASE-p3
>Organization:
>Environment:
8.1-STABLE/amd64 r212312M
7.3-RELEASE-p3/i386 r215233M

>Description:
I'm seeing odd behavior from programs using regex(3) like less(1), vi(1) and sed(1) when using LANG=en_US.UTF-8 and UTF-8 inputs.

Sometimes it seems to work right:

$ echo 'é' | sed -ne '/^.$/p'
é
$ echo 'éé' | sed -ne '/^..$/p'
éé
$ echo 'aéa' | sed -ne '/a.a/p'
aéa
$ echo 'aéa' | sed -ne '/a.*a/p'
aéa
$ echo 'aaéaa' | sed -ne '/aa.aa/p'
aaéaa
$ echo 'aéaéa' | sed -ne '/a.a.a/p'
aéaéa

But not always:

$ echo 'éa' | sed -ne '/.a/p'
$ echo 'aéaa' | sed -ne '/a.aa/p'
$ echo 'éaé' | sed -ne '/.a./p'


Seems like using ".*", ".+", ".{0,}" or ".{1,}" works right, but ".{0,1}", ".{1,1}" or a lone "." doesn't always.

>How-To-Repeat:

>Fix:


>Release-Note:
>Audit-Trail:
>Unformatted:



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201012281757.oBSHvqcr022002>