Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 16 Oct 1995 19:39:54 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        ache@astral.msk.su (=?KOI8-R?Q?=E1=CE=C4=D2=C5=CA_=FE=C5=D2=CE=CF=D7?=)
Cc:        terry@lambert.org, hackers@freefall.freebsd.org, joerg_wunsch@uriah.heep.sax.de, kaleb@x.org
Subject:   Re: A couple problems in FreeBSD 2.1.0-950922-SNAP
Message-ID:  <199510170239.TAA26264@phaeton.artisoft.com>
In-Reply-To: <ZlWzmWmyv3@ache.dialup.demos.ru> from "=?KOI8-R?Q?=E1=CE=C4=D2=C5=CA_=FE=C5=D2=CE=CF=D7?=" at Oct 17, 95 05:05:20 am

next in thread | previous in thread | raw e-mail | index | archive | help
> >I can *potentially* see ispunct() (though I can't think of any
> >concrete examples off my head; maybe in -9?), and the collating
> >sequence is a problem.
> 
> Not only ispunct(), if you dig deeper, just put all 8859-*
> charset in front of you and see how they are really differ.

All my good standards documents are at home.  8-(.

> >But this is a problem regardless.  If the code isn't internationalized,
> >it isn't internationalized, and anything you do to pretend it is without
> >actually fixing the code is a kludge.
> 
> ASCII default table is the way to avoid pitfails because it restricts
> all operations to 7bit (disallows all 8bit stuff). It is one of the
> reasons why I vote for ASCII.

It doesn't avoid the pitfalls.  A "sort" that isn't localized will fall
over silently when fed non-ASCII data, just like data in the wrong locale.


> >The correct thing to do is to call setlocale() in the source.  You could,
> >if you wanted a "quick fix", use setlocale(,""), per your crt0.o hack.
> 
> Calling setlocale() is impossible for 8bit clean programs if they
> are not aware of multi-byte characters. I use special version of
> setlocale (statrtup_setlocale) in my crt0 hack which is restricted to <=8bit
> char sizes only.

BS.  Read XPG/3, which doesn't know about multibyte characters, yet
defines setlocale().

It's true that calling an XPG/4 setlocale() in an 8-bit clean, multibyte
unaware program in a runic encoded/multibyte locale will fail.  The
soloution is to fix the program.

> >If you care about collation sequence, then you'll internationalize your
> >code.
> 
> Well, but how's about strftime? It isn't supposed to call setlocale before
> or what? I saw several places in our sources when strftime was called
> without any setlocale, does all of them need to be fixed?

Yes.  If they use locale data which can not be defaulted per XPG3/XPG4,
the code needs to be fixed.

> >Then use 8859-5 character encoding.  The only deficiency re: KOI8 is
> >that it doesn't match existing data you already have on disk.
> 
> 8859-5 not goes in any case. It not my decision but whole russian users
> community (SUUG - Soviet Unix Users Group).

Then have then replace it by sending a representative to the national
standards body that sends their representative to ISO.  But follow the
8859-x formulation rules this time.

I think the major problem you have with 8859-5 is installed code base,
and that's not an excuse (or American generated code would not know
about locale at all: our installed base is 7 bit ASCII, thank you).

> >> It means that
> >> 1) all is*() macros must be correct for russian charset (LC_CTYPE).
> 
> >This will work for 8859-5.  Characters that are completely bogus will
> >fail, but they'd fail anyway.
> 
> Characters that are completely bogus in 8859-1 is valid letters
> in 8859-5. :-)

Guess you better call setlocale() in your code, then.  8-).

> >> 2) strftime must return national data (LC_TIME).
> 
> >Explicitly call setlocale().
> 
> See above.

Not applicable or XPG/3 would have been impossible to implement.  Fix
the code.

> >KOI8 is a peculiar locale in that it doesn't follow the 8859-x rules
> >like it should.  Like EBCDIC, it needs to die in the long term.  On
> 
> And WHY IT SHOULD DO anything? It is EXISTEN CODE TABLE and LOCALES
> must be adopted for it and not vice versa. I promise you that
> it never dies in nearest 20-40 years, its population grows
> whith each new Internet user.

That's too bad.  I guess you will have to properly localize in order
to use software, then, instead of taking advantage of 8859-x formulation
rules to get 90% soloutions.  Like you'd be able to do if the standard
you chose to use met those guidelines.

Or you'll have to run local hacks, like the one currently in crt0.o,
which is unobtrusive now, but which you wanted to make very obtrusive,
when most of us were under the impression that it was a temporary hack
that was going to go away.

> >This whole issue is very similar to the problems that were involved in
> >going to an unmapped page 0, causing NULL dereferences to SIGSEGV.  In
> >the short term, you lost functionality because you couldn't run some
> >programs you used to be able to run.
> 
> It isn't correct example for this case.

Why not?  You are arguing that your hack is OK, though it violates the
rule of least astonisment by killing, among other things, xterm.

Similarly, the unmapping of page 0 violates the rule of least astonishment
by making programs that used to work fail.

Is one standards-adherence forced failure superior to another?

> >In the locale case, you lose the ability to run 8 bit clean code as if
> >it had been properly internationalized, while making other code plain
> >miserable to use.
> >
> >Without the imlied setlocale() call in crt0.o, there is an immediate
> >benefit of ~1.1M of disk in static binaries (from Kaleb's numbers), and
> 
> It is less than this value, if you want, I'll tell exactly.

I think the important question is "is it non-zero?".

> >the code that isn't internationalized becomes readily apparent.  Just
> >as the code that dereferenced NULL became readily apparent when page 0
> >was unmapped.
> 
> *WHO* will internationalize such code?

People who get pissed off when it fails.  People they piss off by
complaining to them.  The authors, who get pissed off when people
complain to them.

The entire point of doing things this way was an attempt to increase
the annoyance factor for American software companies who wanted to
get into international markets and force them to do the right thing
by getting pissed off OEMs, VARs, VADs, and users on their backs.


> >Setting an "undefined" equality with 8859-1 preserves 8 bit clean
> >operability in the majority of cases, and in the others, the only
> >way that they could have been able to get the functionality was to
> >have partially internationalized their code (you can't get at the
> >altered collation sequence without some knowledge of internationalization
> >implicit in the code).
> >
> >The net effect is that more code gets internationalized correctly, which
> >is in everyone's best interests and increases the code portability instead
> >of tying the users to FreeBSD.
> 
> Well, as I already say only thing that makes me stay against
> propogating was non-matched is*() stuff. From your words
> I assume that you simple do nothing with it and marks
> incompatibles as 'improper i18n in any case', well,
> it is the way :-)

Yes.  Remember that most of the issues of sort order, etc., aren't fixed
for anyone but the US and Great Britain anyway.  It's *still* broken
with an 8859-1 C locale.

> BTW,
> 
> I assume to keep my hack in the state as is, because too many
> russian users already relays on it. I consider possibilites to
> reduce bloat by ways that Bruce point, i.e. libc ctype cleanup
> and two different startup_locale stubs for real ctype and for fake.

Probably it would do well to #ifdef your hack and default it to off;
then the annoyance choice is whether to fix the program or rebuild
the system, with the hope being that the program gets fixed.

It's possible that you could set up a distribution or alternate binaries
with the hack in place.  If so, they should be annoying to obtain.  8-).

Answering questions that the answer is to get the other binaries or
rebuild the system is the annoyance factor for people in the ASCII
or 8859-1 locales need to make them fix their code.  Like me.

Without something like this, we Americans (and, presumably, the British)
will sit on our butts and keep writing bogus code.  We don't care: it
doesn't matter to us if the C locale is 7 bit ASCII or 8859-1, our butt
is covered, the code will work on our boxes.

Makes me want to define a C locale that has nothing but trash in it.  8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199510170239.TAA26264>