Date: Tue, 20 Jan 1998 19:15:41 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: jfieber@indiana.edu Cc: akm@mother.sneaker.net.au, louie@TransSys.COM, daniel_sobral@voga.com.br, tlambert@primenet.com, hackers@FreeBSD.ORG Subject: Re: Wide characters on tcp connections Message-ID: <199801201915.MAA26214@usr04.primenet.com> In-Reply-To: <Pine.BSF.3.96.980120101241.26398Z-100000@fallout.campusview.indiana.edu> from "John Fieber" at Jan 20, 98 10:40:47 am
next in thread | previous in thread | raw e-mail | index | archive | help
> > | If you're looking for a standard way to move multibyte characters, then > > | choose any one of a number of encodings already used to store multibyte > > | characters in files. > > > > Moving them's not quite the same as storing them.... byte orders, usually > > come into play a lot more when you've got to shunt the data across a network. > > > > I think Unicode defines that it is to be stored in network byte order. > > Maybe this will clarify things a bit. From _The Unicode Standard > 2.0_, Section 3.1 Conformance Requirements: > > C1. A process shall interpret Unicode code values as 16-bit > quantities. > > C2. The Unicode Standard does not specify any order of bytes > inside a Unicode value. > > C3. A process shall interpret a Unicode value that has been > serialized into a sequence of bytes, by most significant byte > first, in the absence of higher level protocols. > > If you think of writing to a file as serializing, then C3 > applies. If you think of it as dumping memory, then C2 applies. > I believe NT takes generally takes the C2 route. Terry, can you > confirm this? How about for IPC? For wide character strings for IPC, the character strings are sent in native byte order with a byte order indicator. This is consistent with DCE RPC's XDR, and with the Microsoft bias toward Intel-centric representation mechanisms. I believe the File I/O interfaces also expect Intel byte order in the files, so that they do not have to rewrite thier files for NT on platforms with network byte order, as opposed to Intel byte order. > Just as a footnote, UTF-8 is a big win for English text because > it generally ends up 1 character == 1 byte, but is a big loss for > CJK (among others) where 1 character == 3 bytes. UTF-8 is no > silver bullet for endian debates. Any multibyte encoding is a loss for: o Fixed field storage o Forms input o Length-limited buffer technologies (like those in most modern computer languages in use today). o String length calculation Etc. 8-(. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199801201915.MAA26214>