Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 02 Aug 2014 05:22:38 +0400
From:      Dmitry Selyutin <ghostman.sd@gmail.com>
To:        Pedro Giffuni <pfg@FreeBSD.org>, David Chisnall <theraven@freebsd.org>, soc-status@FreeBSD.org
Subject:   Report #5: Unicode support
Message-ID:  <53DC3D5E.5080909@gmail.com>
In-Reply-To: <53DC3C41.7070105@gmail.com>
References:  <53DC3C41.7070105@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Sorry, I've forgotten to modified theme according to rules.
Sending this message again so anyone can find it more easy.
Sorry for being annoying.


Hello everyone!

Here is my report on progress that was achieved during this time. I've
implemented actual Unicode Collation Algorithm for DUCET (Default
Unicode Collation Element Table). I had to rewrite the entire
implementation: I wasn't satisfied with its quality and the way that
I've organized my source code, so I reverted my code and started again.
My previous implementation was full of hard-coded parts and it was a bit
harder to take anything useful from it for any other project. Now the
entire implementation is available in include/unicode.h and
lib/libc/unicode. If macro _UNICODE_SOURCE is defined, then wcscoll()
will use new collation algorithm. struct _xlocale was modified in the
way it will use two new members, colltable and collsize, which are just
transmitted to __ucscoll(). If element is not found in the given table
or table is NULL, then __ucscoll() tries to find this element in DUCET;
if element was not found, then __ucscoll generates collation.

I couldn't understand how the alternate shall be used though; it seems
that it can be dropped since wcscoll() doesn't has any version that
supports tailoring. I left it for now, but I'm pretty sure that we can
omit it.

I hadn't time to test wcscoll() better (especially using files provided
by Unicode Character Database), so this is the task that I will do right
now. :-) There are still several ways to improve the speed of the
algorithm, but I feel that the time for it hasn't come yet. style(9)
issues will also be handled (if any), just too tired to do it right now.

__ucscoll() just uses __ucsxfrm(), then compares the strings using
wcscmp() (this is the only platform-dependent part of code, I was too
lazy to write __ucslen(), so I left it as it is). This collation
algorithm support three levels; the last IIRC is usually the character
itself if not defined, so I decided to omit it (especially since I'm not
sure how variable weights should be handled). Any thoughs?

-- 
With best regards,
Dmitry Selyutin




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53DC3D5E.5080909>