Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 26 Jun 2016 16:34:11 +0200
From:      Polytropon <freebsd@edvax.de>
To:        =?UTF-8?B?RGFuacOrbA==?= de Kok <me@danieldk.eu>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: grep and anchoring
Message-ID:  <20160626163411.d05f863e.freebsd@edvax.de>
In-Reply-To: <20232C89-B821-41EC-9188-C2A19C679BD8@danieldk.eu>
References:  <20232C89-B821-41EC-9188-C2A19C679BD8@danieldk.eu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 26 Jun 2016 15:10:57 +0200, Daniël de Kok wrote:
> Dear all,
> 
> After a BSD hiatus of many years, I am tinkering with FreeBSD again.
> I’ve run into some strange issue with grep and beginning of line (^)
> anchoring:
> 
> —
> % echo "1234 1234 1234" | egrep -o '^….'
> 1234
>  123
> 4 12
> % echo "123412341234" | egrep -o '^....'
> 1234
> 1234
> 1234
> —
> 
> Any idea what is going on here?

I think what you see here is a typical "UTF-8 fsck-up".
The first search pattern contains a an ellipsis ("…",
2 bytes long, representing 3 characters), and a single
dot (".", one byte long, 1 character); the second pattern
contains four dots (4 x ".", 1 byte long, 1 character).
Of course grep interprets "…" and "..." differently.
In my mailer, I can see the difference clearly as the
ellipsis … is displayed in monospace font as a _one_
character wide symbol on the screen.

Or is this just an "enrichment" your MUA added? :-)

I'm quite sure you run into similar problems when you
include ligatures (like st, ft, ffi, ck or the like)
or one of the many different hyphend and spaces in a
search pattern. :-)

Otherwise, your example seems to show the expected
behaviour.

	% echo "1234 1234 1234" | egrep -o '^....'
	1234
	 123
	4 12

	% echo "123412341234" | egrep -o '^....'
	1234
	1234
	1234

First 4-character pattern is "1234", next is " 123",
and last is "4 12" (each 4 characters wide, as the
space character " " is also "any character" that matches
the . pattern). In the second example, the groups match
4 characters each ("1234" x 3).

What different results did you expect? Or am I misinterpreting
your question?


-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160626163411.d05f863e.freebsd>