Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 20 Jan 1998 19:15:41 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        jfieber@indiana.edu
Cc:        akm@mother.sneaker.net.au, louie@TransSys.COM, daniel_sobral@voga.com.br, tlambert@primenet.com, hackers@FreeBSD.ORG
Subject:   Re: Wide characters on tcp connections
Message-ID:  <199801201915.MAA26214@usr04.primenet.com>
In-Reply-To: <Pine.BSF.3.96.980120101241.26398Z-100000@fallout.campusview.indiana.edu> from "John Fieber" at Jan 20, 98 10:40:47 am

next in thread | previous in thread | raw e-mail | index | archive | help
> > | If you're looking for a standard way to move multibyte characters, then
> > | choose any one of a number of encodings already used to store multibyte
> > | characters in files.
> > 
> > Moving them's not quite the same as storing them.... byte orders, usually
> > come into play a lot more when you've got to shunt the data across a network.
> > 
> > I think Unicode defines that it is to be stored in network byte order.
> 
> Maybe this will clarify things a bit.  From _The Unicode Standard
> 2.0_, Section 3.1 Conformance Requirements: 
> 
> C1. A process shall interpret Unicode code values as 16-bit
>     quantities. 
> 
> C2. The Unicode Standard does not specify any order of bytes
>     inside a Unicode value.
>     
> C3. A process shall interpret a Unicode value that has been
>     serialized into a sequence of bytes, by most significant byte
>     first, in the absence of higher level protocols.
> 
> If you think of writing to a file as serializing, then C3
> applies.  If you think of it as dumping memory, then C2 applies. 
> I believe NT takes generally takes the C2 route. Terry, can you
> confirm this?  How about for IPC? 

For wide character strings for IPC, the character strings are sent
in native byte order with a byte order indicator.  This is consistent
with DCE RPC's XDR, and with the Microsoft bias toward Intel-centric
representation mechanisms.

I believe the File I/O interfaces also expect Intel byte order in the
files, so that they do not have to rewrite thier files for NT on
platforms with network byte order, as opposed to Intel byte order.

> Just as a footnote, UTF-8 is a big win for English text because
> it generally ends up 1 character == 1 byte, but is a big loss for
> CJK (among others) where 1 character == 3 bytes.  UTF-8 is no
> silver bullet for endian debates.

Any multibyte encoding is a loss for:

o	Fixed field storage
o	Forms input
o	Length-limited buffer technologies (like those in most
	modern computer languages in use today).
o	String length calculation

Etc.

8-(.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199801201915.MAA26214>