Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Jul 2004 07:35:59 +0000 (UTC)
From:      "Tim J. Robbins" <tjr@FreeBSD.org>
To:        src-committers@FreeBSD.org, cvs-src@FreeBSD.org, cvs-all@FreeBSD.org
Subject:   cvs commit: src/lib/libc/regex engine.c regcomp.c regex2.h regexec.c regfree.c
Message-ID:  <200407120735.i6C7Zx2f005903@repoman.freebsd.org>

next in thread | raw e-mail | index | archive | help
tjr         2004-07-12 07:35:59 UTC

  FreeBSD src repository

  Modified files:
    lib/libc/regex       engine.c regcomp.c regex2.h regexec.c 
                         regfree.c 
  Log:
  Make regular expression matching aware of multibyte characters. The general
  idea is that we perform multibyte->wide character conversion while parsing
  and compiling, then convert byte sequences to wide characters when they're
  needed for comparison and stepping through the string during execution.
  
  As with tr(1), the main complication is to efficiently represent sets of
  characters in bracket expressions. The old bitmap representation is replaced
  by a bitmap for the first 256 characters combined with a vector of individual
  wide characters, a vector of character ranges (for [A-Z] etc.), and a vector
  of character classes (for [[:alpha:]] etc.).
  
  One other point of interest is that although the Boyer-Moore algorithm had
  to be disabled in the general multibyte case, it is still enabled for UTF-8
  because of its self-synchronizing nature. This greatly speeds up matching
  by reducing the number of multibyte conversions that need to be done.
  
  Revision  Changes    Path
  1.14      +92 -40    src/lib/libc/regex/engine.c
  1.32      +253 -259  src/lib/libc/regex/regcomp.c
  1.8       +57 -17    src/lib/libc/regex/regex2.h
  1.6       +64 -3     src/lib/libc/regex/regexec.c
  1.6       +10 -3     src/lib/libc/regex/regfree.c



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200407120735.i6C7Zx2f005903>