From owner-freebsd-hackers@FreeBSD.ORG  Thu Nov 20 07:13:05 2008
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1634C1065687
	for <hackers@freebsd.org>; Thu, 20 Nov 2008 07:13:05 +0000 (UTC)
	(envelope-from kientzle@freebsd.org)
Received: from kientzle.com (kientzle.com [66.166.149.50])
	by mx1.freebsd.org (Postfix) with ESMTP id CF7828FC0C
	for <hackers@freebsd.org>; Thu, 20 Nov 2008 07:13:04 +0000 (UTC)
	(envelope-from kientzle@freebsd.org)
Received: from [10.123.2.178] (p53.kientzle.com [66.166.149.53])
	by kientzle.com (8.12.9/8.12.9) with ESMTP id mAK7D3tv027473;
	Wed, 19 Nov 2008 23:13:04 -0800 (PST)
	(envelope-from kientzle@freebsd.org)
Message-ID: <49250DFA.2050208@freebsd.org>
Date: Wed, 19 Nov 2008 23:12:58 -0800
From: Tim Kientzle <kientzle@freebsd.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060422
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Nick Hibma <nick@van-laarhoven.org>
References: <200811190842.59377.nick@van-laarhoven.org>
In-Reply-To: <200811190842.59377.nick@van-laarhoven.org>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Cc: FreeBSD Hackers Mailing List <hackers@freebsd.org>
Subject: Re: Unicode USB strings conversion
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 20 Nov 2008 07:13:05 -0000

Nick Hibma wrote:
> In the USB code (and I bet it is the same in the USB4BSD code) unicode 
> characters in strings are converted in a very crude way to ASCII. As I have 
> a user on the line who sees rubbish in his logs and when using 
> usbctl/usbdevs/etc., I bet this is the problem.
> 
> I'd like to try and fix this problem by using libkern/libiconv.
> 
> 1) Is this the right approach to convert UTF8 to printable string in  the 
> kernel?
> 
> 2) Is this needed at all in the short term future? I remember seeing 
> attempts at making the kernel use UTF8.
> 
> 3) Does anyone know of a good example in the code without me having to hunt 
> through the kernel to find it?
> 
> For reference: The code that needs replacing is:
> 
> usbd_get_string():
> 
>         s = buf;
>         n = size / 2 - 1;
>         for (i = 0; i < n && i < len - 1; i++) {
>                 c = UGETW(us.bString[i]);
>                 /* Convert from Unicode, handle buggy strings. */
>                 if ((c & 0xff00) == 0)
>                         *s++ = c;
>                 else if ((c & 0x00ff) == 0 && swap)
>                         *s++ = c >> 8;
>                 else
>                         *s++ = '?';
>         }
>         *s++ = 0;
> 
> I haven't got the USB specs handy, but I believe that this is a simple way 
> of converting LE and BE UTF8 to ASCII.

First, get your terminology straight.  It looks
like UGETW() is returning 16-bit Unicode code points.
That would be UTF-16, not UTF-8.  UTF-8 is a popular
multibyte encoding which uses 1 to 4 bytes per character.
ASCII values (less than 128) get preserved, anything else
gets encoded.

There are two problems with UTF-16:  First is determining
the byte order.  Second is that nobody displays UTF-16
directly.  (Well, almost nobody.)

The code above is fine if you're sure you're getting ASCII
(it looks at each character and guesses the byte order)
but is otherwise pretty lame.  You didn't show the code
that set the 'swap' variable.

If you really want legible output, your best option by
far is to really convert it to UTF8 and emit that.  That
still preserves ASCII, but gives a chance of viewing
non-ASCII in a suitable terminal program.  (And there
are even a couple of folks looking into UTF8 support for
syscons.)

<rolling up sleeves>  The basic UTF-16 to UTF-8
conversion is pretty simple:

      if (c < 0x7f) { *s++ = c; }
      else if (c < 0x7ff) {
	*s++ = 0xc0 | ((c >> 6) & 0x1f);
	*s++ = 0x80 | (c & 0x3f);
      } else if (c < 0xffff) {
	*s++ = 0xe0 | ((c >> 12) & 0x0f);
	*s++ = 0x80 | ((c >> 6) & 0x3f);
	*s++ = 0x80 | (c & 0x3f);
      } else {
	*s++ = 0xf0 | ((c >> 18) & 0x07);
	*s++ = 0x80 | ((c >> 12) & 0x3f);
	*s++ = 0x80 | ((c >> 6) & 0x3f);
	*s++ = 0x80 | (c & 0x3f);
      }

This assumes that 'c' is a UTF-16 Unicode character
in native byte order.  If you really don't know the
byte order, you'll need to find some way to guess.

One way to guess is to assume that ASCII characters
are common, in which case, you'll see things with the
high order byte 0.  In some environments, a "Byte-order mark"
is used as the first character.  This is character 0xFEFF.
(The byte-swapped 0xFFFE is illegal, so if you see that,
you know you've got the wrong byte order.)

Good luck!

Tim