Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 20 Jan 1998 21:18:36 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        Pierre.Beyssac@hsc.fr (Pierre Beyssac)
Cc:        louie@TransSys.COM, tlambert@primenet.com, daniel_sobral@voga.com.br, hackers@FreeBSD.ORG
Subject:   Re: Wide characters on tcp connections
Message-ID:  <199801202118.OAA27310@usr06.primenet.com>
In-Reply-To: <19980120120216.OB37901@mars.hsc.fr> from "Pierre Beyssac" at Jan 20, 98 12:02:16 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> I can add that, if I've understood UTF-8 right, it's fairly easy to
> resynchronize in case you happen to lose sync. It just takes one or
> two lost or garbled chars. I think that UTF-8 is one of the ways to
> go. Its only drawback is that it's not compatible with "pure" 8 bits
> ISO-Latin-1 streams as it reuses 0x80-0xff.

It will take up to 3 bytes to resync, since it can take up to 5
bytes to represent a single 16 bit value.

This assumes you are willing to push an arbitrary number of bytes
to get a 16 bit value to the other end of the pipe, and that you are
willing to take the computational overhead of the conversion, and
that you are willing to treat your values as a stream instead of
an external data representation of a structure (ie: you are willling
to give up being able to tell the other end to expect a certain number
of bytes in a transaction).

UTF encoding is evil personified if you are doing database work.  You
never know how many "real" characters (16 bit values) can be stored in
any N bytes of a fixed field..  This makes input complicated, since
you must veto base on the UTF encoding filling up the field or not,
makes it impossible to fully specify field length in a schema, and in
general, makes life Hell.

The people who like UTF encoding are the people who've already had
thier mail forwarded to Hell, mostly though already losing these
programmatically useful abilities to some other evil, like EUC
encoding and ISO2022.

FWIW, CIFS (aka SMB) ships long (Unicode) names over the wire in
wchar_t's in x86 byte order.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199801202118.OAA27310>