From owner-freebsd-questions@FreeBSD.ORG Sat Apr 21 20:07:06 2012 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 68E521065672 for ; Sat, 21 Apr 2012 20:07:06 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from mx02.qsc.de (mx02.qsc.de [213.148.130.14]) by mx1.freebsd.org (Postfix) with ESMTP id 27A158FC0C for ; Sat, 21 Apr 2012 20:07:06 +0000 (UTC) Received: from r56.edvax.de (port-92-195-124-250.dynamic.qsc.de [92.195.124.250]) by mx02.qsc.de (Postfix) with ESMTP id B0C3A24867; Sat, 21 Apr 2012 22:07:04 +0200 (CEST) Received: from r56.edvax.de (localhost [127.0.0.1]) by r56.edvax.de (8.14.5/8.14.5) with SMTP id q3LK73SC002822; Sat, 21 Apr 2012 22:07:04 +0200 (CEST) (envelope-from freebsd@edvax.de) Date: Sat, 21 Apr 2012 22:07:03 +0200 From: Polytropon To: Lars Eighner Message-Id: <20120421220703.86683bc9.freebsd@edvax.de> In-Reply-To: References: <20120421055823.GA6788@tinyCurrent> <4F9253D7.7010609@locolomo.org> <4F9278A2.1020301@locolomo.org> Organization: EDVAX X-Mailer: Sylpheed 3.1.1 (GTK+ 2.24.5; i386-portbld-freebsd8.2) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-questions@freebsd.org Subject: Re: converting UTF-8 to HTML X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Polytropon List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Apr 2012 20:07:06 -0000 On Sat, 21 Apr 2012 09:10:03 -0500 (CDT), Lars Eighner wrote: > On Sat, 21 Apr 2012, Erik N=F8rgaard wrote: >=20 > > When characters show up wrong in the users browser it's usually because= the=20 > > browser is set to use a non-UTF-8 charset by default such as windows-12= 52,=20 > > the web server sends the charset=3Dascii in the http header and there i= s no or=20 > > incorrect meta tag to resolve the problem. Non UTF-8 charsets are a lef= tover=20 > > from last millenia that we sometimes still choke on .. sorry the rant ;) >=20 > UTF-8 is a waste of storage for most people [...] Disks and RAM are huge and cheap. Plenty of space that is going to be used. Nobody cares. > [...] and is incompatiple with > text-mode tools: it's simple another bid to make it impossible to run > without a GUI. Again, nobody cares - until, of couse, it's too late and you need to do some recovery or analytic tasks in a limited environment or via a connection with limited means. Regarding the fun of encodings, endianness, representation, use ("fi" the two letters vs. "fi" the ligature, or "=DF" the 1-byte sequence vs. "=DF" the two-byte sequence), see the following document: Matt Mayer: Love Hotels and Unicode http://www.reigndesign.com/blog/love-hotels-and-unicode/ And finally it offers an interesting attack vector, given the fact that several unicode characters "look" the same, but in fact are different. So "two files with the 'same' name" is a possible means that malware implementers can utilize to mislead the users. Short example from MICROS~1 land here: http://blogs.technet.com/b/mmpc/archive/2011/08/10/can-we-believe-our-eyes.= aspx But this all doesn't negate the usefulness of unicode / UTF-8 in general. Especially when you have collaborative settings with multi-language document processing requirements, it is a helpful thing, as working with "normal" (ASCII) letters, cyrillic ones, chinese and japanese symbols, arabic writing is no big deal as long as all the tools do properly support it the _same_ way. --=20 Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...