Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 21 Apr 2012 22:07:03 +0200
From:      Polytropon <freebsd@edvax.de>
To:        Lars Eighner <lars@larseighner.com>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: converting UTF-8 to HTML
Message-ID:  <20120421220703.86683bc9.freebsd@edvax.de>
In-Reply-To: <alpine.BSF.2.00.1204210909450.5338@abbf.6qbyyneqvnyhc.pbz>
References:  <20120421055823.GA6788@tinyCurrent> <4F9253D7.7010609@locolomo.org> <4F9278A2.1020301@locolomo.org> <alpine.BSF.2.00.1204210909450.5338@abbf.6qbyyneqvnyhc.pbz>

Next in thread | Previous in thread | Raw E-Mail | Index | Archive | Help
On Sat, 21 Apr 2012 09:10:03 -0500 (CDT), Lars Eighner wrote:
> On Sat, 21 Apr 2012, Erik N=F8rgaard wrote:
>=20
> > When characters show up wrong in the users browser it's usually because=
 the=20
> > browser is set to use a non-UTF-8 charset by default such as windows-12=
52,=20
> > the web server sends the charset=3Dascii in the http header and there i=
s no or=20
> > incorrect meta tag to resolve the problem. Non UTF-8 charsets are a lef=
tover=20
> > from last millenia that we sometimes still choke on .. sorry the rant ;)
>=20
> UTF-8 is a waste of storage for most people [...]

Disks and RAM are huge and cheap. Plenty of space that is
going to be used. Nobody cares.



> [...] and is incompatiple with
> text-mode tools: it's simple another bid to make it impossible to run
> without a GUI.

Again, nobody cares - until, of couse, it's too late and you
need to do some recovery or analytic tasks in a limited
environment or via a connection with limited means.

Regarding the fun of encodings, endianness, representation,
use ("fi" the two letters vs. "fi" the ligature, or "=DF"
the 1-byte sequence vs. "=DF" the two-byte sequence), see
the following document:

Matt Mayer: Love Hotels and Unicode
http://www.reigndesign.com/blog/love-hotels-and-unicode/

And finally it offers an interesting attack vector, given
the fact that several unicode characters "look" the same,
but in fact are different. So "two files with the 'same'
name" is a possible means that malware implementers can
utilize to mislead the users.

Short example from MICROS~1 land here:
http://blogs.technet.com/b/mmpc/archive/2011/08/10/can-we-believe-our-eyes.=
aspx

But this all doesn't negate the usefulness of unicode / UTF-8
in general. Especially when you have collaborative settings
with multi-language document processing requirements, it
is a helpful thing, as working with "normal" (ASCII) letters,
cyrillic ones, chinese and japanese symbols, arabic writing
is no big deal as long as all the tools do properly support
it the _same_ way.



--=20
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...



Want to link to this message? Use this URL: <http://docs.FreeBSD.org/cgi/mid.cgi?20120421220703.86683bc9.freebsd>