Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 18 Oct 1995 19:20:55 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        kaleb@x.org (Kaleb S. KEITHLEY)
Cc:        terry@lambert.org, hackers@freefall.FreeBSD.org
Subject:   Re: xterm dumps core
Message-ID:  <199510190220.TAA01615@phaeton.artisoft.com>
In-Reply-To: <199510190052.UAA00286@exalt.x.org> from "Kaleb S. KEITHLEY" at Oct 18, 95 08:52:36 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > The main issue here is whether a single "Unicode font" is possible or not.
> 
> Possible or practical given the current technology base?

Both, since I have a 14 point font that is complete except for
ligatured languages, and there is a company in Taiwan that has
a uxterm and a *bunch* of fonts along the same lines, plus they
do the dancing to support the ligatured stuff.

> > For non-ligatured languages (ie: not Arabic, Hebrew, Tamil, Devengari,
> > or English script ["Cursive"]), the answer is "yes, it is possible,
> > as long as we are talking about internationalization (enabling for
> > soft localization to a single locale) ...
> 
> Possible or practical given the current technology base?

Both.  Less than 1M of ROM for inclusion in an xterminal.  ROM is cheap.

> > ...instead of multinationalization
> > (ability to simultaneously support multiple glyphs for a single
> > character encoding, ie: the Han Unifications, etc.).
> 
> You're talking about a stateful encoding, where the glyph for a particular
> character is dependent on preceeding or suceeding characters; conceptually 
> similar to using compose sequences to enter characters in the right half 
> of Latin-1 using a QWERTY keyboard, although that's probably an over-
> simplification.

Not really.  Unicode calls multiple simultaneous languages in unshared
character set standards for round-tripping "Rich Text Format".  The
rest of the world would call this a "Compound Document".

For ligatured languages, the glyph selection can be on the basis of
the ordinal in the round trip set being multiplied by some constant
and then added to a constant offset to get the encoding area in
Unicode.  The selection of the actual glyph to be displayed from
the set of those of [0..(constant-1)] is done on the basis of dividing
the constant into 4 zones: (i) space before (ii) space after (iii) no
space (entry), and (iv) no space (exit).  This allows up to (constant/4)
ligature joining points from character to character.  Then drawing
is done on the basis of precomputation of adjacency in the application
text draw library.

Expensive, but less expensive than ruling out X fonts as a rendering
technology and computing the ligatures at run time into a bitmap and
copying it down.

> > For ligatured languages, it's possible to either adopt a locale
> > regocognized block print font (Hebrew has one), or redefine the
> > areas where the ligatured fonts lie as "private use" areas (in tacit
> > violation of the standard), and respecify character encodings and
> > round-trip tables for those languages.
> 
> I believe we have stateful encodings right now. You seem to be saying
> that stateful encodings aren't possible in Unicode.

They aren't.  You must use a "Rich Text Format", and then choose to
do the conversion from the ordinal value into a character/font
selector from that information, or you must allow them in Unicode
by wedging them in -- though the only languages that would require
this are those that require ligaturing.

For the previous zone example, a 16:1 compatability would provide a
2 pixel variance on an 8 point font: sufficient to produce reasonable
test from the data without font switching, only zone/boundry
comparisons in selecting which of the 16 glyphs for each character
should be used to line up with the ligature point on the next
character.

> > Keyboard input methodology is an interpretational issue, and is only
> > loosely bound to the fact that X (improperly) assigns keycode values
> > based on internal knowledge of keycap legends.  This is loosely bound
> > because of the ability to symbolically rebind these values with single
>                  ^^^^^^^
> ??? The ability or the inability to rebind values?

Ability.  If the keycodes weren't run through a translation prior to
use (ie: the program didn't follow the rules on allowing key bindings),
then it would be tightly bound.

> > forward table references.
> 
> > The "support for locale-based characater set designation" argues on the
> > basis of a choice of a character set that is a subset of Unicode, or
> > is an artifact of coding technique (ie: "xtamil").
> > 
> > I believe this to be a largely specious argument.
> 
> I don't follow you. I'm confident that when vendors start supplying a
> Unicode locale, that the X locale mechanism is extensible and flexible
> enough to follow suit.

The problem is that Unicode is a character set standard, not a glyph
encoding standard... at least that is the intent.  That means that it's
an OK process encoding standard and an OK storage encoding standard,
but that you are expected to either make a round trip conversion to
another character set to do the display, or you are expected to implement
a subset font that is representative of the code points in a round trip
standard that are in Unicode, and no other characters in Unicode.

Consider a document in Japanese that describes "how to read Chinese";
the characters in the description and in the described text could very
well have been unified to single code points, and thus the attribution
of font is outside the scope of Unicode itself (there exists no standard
character set that contains both Chinese and Japanese as seperate
entities).


Pragmatically, it would be useful to be able to display all characters
in a file as if they were Unicode characters, and not be so anal about
the actual locale they were generated in.  That is, you would display
Chinese characters as if they were Japanese, etc. as a side effect of
the CJK Unification.  This is especially useful if you did an "ls -R"
on a data vault or some other shared storage system.

So while there is no such thing as a "Unicode" locale, nor can there be,
it would be useful to pretend that one existed for purely pragmatic
reasons.

If we get into font switching, then there is little difference between
selecting glyphs by ligatured character adjacency vs. selecting them
by "Rich Text" desination of the real locale for the characters; at
the point you start switching fonts, you've already lost what little
margin you had available to you to "pretend" there was a single locale,
and you start dragging in abstractions built on "FontSet" selection
into each and every program that expects to be able to display data.

> > What the ANSI/POSIX/ISO standards *do* lack is the ability to specify
> > locale-based input methods for distinct sub-character set based locales
> > as part of the locale information.
> 
> Do you mean e.g. the ability to switch to an alternative character set/
> encoding such as Arabic in a Latin-1 locale?

No.  The character set for an input device is never Unicode; Humans
haven't unified all keyboard input mechanisms for all human languages,
largely because the keyboards have keycap legends and so it is not
really possible to do it cleanly.  I suppose you could look at it as
picking a character set which you will then round-trip to Unicode
process encoding by virtue of the device you have chosen to use for
your input.

The closest you could get to actual switchable attribution would be
for a multilingual touch-typist for keyboards with identical form
factors and no keycap markings, or little display panels on each key
to allow variation in keycap from moment to moment (I claim prior art
on this if anyone goes off and implements one 8-)).

Since typical usage of a computer occurs in the common character set
for a given locale (indeed, there is currently no other choice because
of conflicts in round-tripping standards), the expense of trying to
do something like that would outweigh the returns.

The "Arabic"/"Latin-1" example you give is actually a conflict case,
since the Arabic character set standard is a superset of ASCII, as is
Latin-1, but the standards are intersecting (except under Unicode
unification), and so no keyboard exists for handling both.

Latin-1 is mabye a bad techinical example, since one could see using
NRCS type character replacement or DEC-style character composition to
either use a 7 bit ASCII varient in place of the ASCII in the Arabic
character set, or using an already familiar chording that would be
applicable on an Arabic/QWERTY dual function keyboard (though from
my experience, I'd say there would be conflict over the ALT-key usage,
I think it would be resolvable, either by dead-key or secondary control
key redesignation).

A better example of an "impossible" switch would be Cyrillic/Greek or
some other combination (Japanese/Chinese), where the "unified" keycap
legends would be too much.

> > This (and runic encoding at all) is why I believe that XPG/4 is itself
> > bogus, thoughit is quite argualbe that locale specificity of input
> > is a problem entirely addressable by hardware alone.
> > 
> > Note that input method *could* be specified by locale specific hardware,
> > as long as one was not intereted in multinationalization and/or various
> > multilingual applications without a single round-trip standard for use
> > in conversion to/from Unicode.
> 
> You lost me again.

You get a keyboard specific to your locale's character set standard, and
since you can only physically be in one locale at a time, unless you
attempt multiligual editing with conflicting standards, you can choose
to use the "right" keyboard.

Such a keyboard could generate Unicode 16 bit characters and up/down and
modifier encodings, such that the keyboard would be what you replaced
when switching locales.  That would save you all of the software crap
you have to go through matching identical keycodes to locale-based keycap
legends.

This would be the ideal soloution, but it will be a cold day in hell
before we see it happen.  8-).

This would save XPG/4 type input of "all character sets into the same
funnel", which is otherwise objectionable.

> > > If you're going to change your locale naming convention then you need 
> > > to document the change where people can find it and preserve the old 
> > > names (perhaps with symlinks) long enough that people can find either 
> > > the changes or the documentation and make the changes necessary in
> > > their software to accomodate your changes.
> > 
> > I don't think anyone has suggested directly modifying locale specification
> > to anything other than ISO standards.  
> 
> No, but Andrey has said that he is going to/has already given new names
> to FreeBSD locales. I consider it a serious mistake to not maintain
> backwards compatibility with previous releases of FreeBSD. Even in going
> to HPUX-10, HP has maintained the HPUX-9 locale names. In HP's case the
> deprecated names will ultimately be deleted in an as yet unnamed release.
> Given how trivial it is to do this I fail to understand his blatant
> disregard for backwards compatibility from one release to the next.

Arguably, this could be our "as yet unnamed release".  How long would
you suggest keeping the window open?  It has been open for one release
so far.  It's premature to discuss this in the context of X, though:
as you pointed out, there is no point in discussing X until a release
is cut and the code isn't going to randomly change out from under you.

The question is really one of how quickly an automated process can be
put in place that hides the actual values from the user.  Once that
has been done, Andrey can deprecate to his heart's content and not
affect anything but software complexity (reducing it).

Locale is almost entirely data-driven, so it is uniquely immune (well,
relatively immune anyway) to legacy code.  You only have legacy
problems if there is hard-coding in binaries, and that's against the
usage rules in the first place.  Punishing rule-breakers is less of
an ethical problem than punishing your average user.

> > The X locale alias mechanism is
> > indeed an artifact of local extensions (ie: AIX "DOSANSI", HP, etc.)
> > rather than an artifact of the deficiencies in the weel defined naming
> > conventions for locales which are not vendor private.
> 
> An artifact of local extensions? I wouldn't say that. I would say it's
> an implementation detail to overcome the lack of consistency in naming
> locales, e.g.: HP's american.iso88591, Digital's en_US.ISO8859-1, SVR4's 
> en_US, SunOS's iso_8859_1 LC_CTYPE, and all the other variations the 
> vendors use for their ISO locale names. The X Consortium release of R6 
> makes no attempt to cover vendor proprietary locales like HP's roman8 
> locales, or AIX and Unixware Codepage 850 locales.

Well, the locale comes from data and is used to map other data.  The
problem is in the systems attempting to support non-data driven first
cause lookups.  In the end, the translation from the implementation
specific locale names should be *to* the official names.  If X has the
official names and FreeBSD has the official names, then there isn't
a problem dropping the support in FreeBSD.

What Andrey *might* be overlooking is when the X server is FreeBSD
and the client is one of the legacy systems.

> As an aside I would say that I believe all these companies take their
> standards compliance very seriously. Yet none of them have a problem with 
> not following RFC 7000 in choosing names for their locales. The switch 
> from foo.ISO8859-1 to foo.ISO_8859-1 seems completely gratuitous. The fact 
> that he will compound it by failing to have any sort of backwards 
> compatibility is inexcusable. 

The backward compatability won't be an issue for the Pure BSD environemnt;
if X throws around only official names internally for things like font
selection, then he should be safe dropping non-RFC 7000 locales entirely.

> Andrey should think about the consequences of upsetting thousands of 
> previously happy FreeBSD users when they discover that the X that they've
> been using just fine for a year or more on FreeBSD 2.0/2.0.5 no longer 
> works, with problems ranging from xterm dumping core to compose processing 
> no longer working.

Using "on" FreeBSD shouldn't be a problem.  Data is data, as long as
it matches.  Using "with" FreeBSD could definitely blow him out of
the water if X isn't RFC 7000 internally.

> > On the other hand, I have no problem whatsoever orphaning vendor-private
> > locale naming mechanisms if it buys an additional level of functionality
> > at no other cost.
> 
> This is not a case of X orphaning vendor locale names. It is a case of
> mapping as many vendor locale names as possible to the corresponding X 
> locale name. It is a X internal implementation detail. It is not, as Andrey 
> claims, a bug that the X Consortium release of R6 does not support the 
> locale names used in an as yet unreleased version of FreeBSD.

I'd argue that there is a difference between "compliant names" and
"correct names".  Andrey can have one or the other, but not both.  The
orphaning of vendor private locales may in fact orphan FreeBSD-private
locales (ie: KOI8-R).

There is a strong emphasis here on X needing to use RFC 7000 for internal
names after doing the alias lookup of a potentially vendor-private locale.


I agree that the aliases file itself is an implementation detail, and may
be a non-issue entirely, depending on the representation used between the
client and the server.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199510190220.TAA01615>