From owner-freebsd-hackers@FreeBSD.ORG Thu Nov 20 07:13:05 2008 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1634C1065687 for ; Thu, 20 Nov 2008 07:13:05 +0000 (UTC) (envelope-from kientzle@freebsd.org) Received: from kientzle.com (kientzle.com [66.166.149.50]) by mx1.freebsd.org (Postfix) with ESMTP id CF7828FC0C for ; Thu, 20 Nov 2008 07:13:04 +0000 (UTC) (envelope-from kientzle@freebsd.org) Received: from [10.123.2.178] (p53.kientzle.com [66.166.149.53]) by kientzle.com (8.12.9/8.12.9) with ESMTP id mAK7D3tv027473; Wed, 19 Nov 2008 23:13:04 -0800 (PST) (envelope-from kientzle@freebsd.org) Message-ID: <49250DFA.2050208@freebsd.org> Date: Wed, 19 Nov 2008 23:12:58 -0800 From: Tim Kientzle User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060422 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Nick Hibma References: <200811190842.59377.nick@van-laarhoven.org> In-Reply-To: <200811190842.59377.nick@van-laarhoven.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Cc: FreeBSD Hackers Mailing List Subject: Re: Unicode USB strings conversion X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Nov 2008 07:13:05 -0000 Nick Hibma wrote: > In the USB code (and I bet it is the same in the USB4BSD code) unicode > characters in strings are converted in a very crude way to ASCII. As I have > a user on the line who sees rubbish in his logs and when using > usbctl/usbdevs/etc., I bet this is the problem. > > I'd like to try and fix this problem by using libkern/libiconv. > > 1) Is this the right approach to convert UTF8 to printable string in the > kernel? > > 2) Is this needed at all in the short term future? I remember seeing > attempts at making the kernel use UTF8. > > 3) Does anyone know of a good example in the code without me having to hunt > through the kernel to find it? > > For reference: The code that needs replacing is: > > usbd_get_string(): > > s = buf; > n = size / 2 - 1; > for (i = 0; i < n && i < len - 1; i++) { > c = UGETW(us.bString[i]); > /* Convert from Unicode, handle buggy strings. */ > if ((c & 0xff00) == 0) > *s++ = c; > else if ((c & 0x00ff) == 0 && swap) > *s++ = c >> 8; > else > *s++ = '?'; > } > *s++ = 0; > > I haven't got the USB specs handy, but I believe that this is a simple way > of converting LE and BE UTF8 to ASCII. First, get your terminology straight. It looks like UGETW() is returning 16-bit Unicode code points. That would be UTF-16, not UTF-8. UTF-8 is a popular multibyte encoding which uses 1 to 4 bytes per character. ASCII values (less than 128) get preserved, anything else gets encoded. There are two problems with UTF-16: First is determining the byte order. Second is that nobody displays UTF-16 directly. (Well, almost nobody.) The code above is fine if you're sure you're getting ASCII (it looks at each character and guesses the byte order) but is otherwise pretty lame. You didn't show the code that set the 'swap' variable. If you really want legible output, your best option by far is to really convert it to UTF8 and emit that. That still preserves ASCII, but gives a chance of viewing non-ASCII in a suitable terminal program. (And there are even a couple of folks looking into UTF8 support for syscons.) The basic UTF-16 to UTF-8 conversion is pretty simple: if (c < 0x7f) { *s++ = c; } else if (c < 0x7ff) { *s++ = 0xc0 | ((c >> 6) & 0x1f); *s++ = 0x80 | (c & 0x3f); } else if (c < 0xffff) { *s++ = 0xe0 | ((c >> 12) & 0x0f); *s++ = 0x80 | ((c >> 6) & 0x3f); *s++ = 0x80 | (c & 0x3f); } else { *s++ = 0xf0 | ((c >> 18) & 0x07); *s++ = 0x80 | ((c >> 12) & 0x3f); *s++ = 0x80 | ((c >> 6) & 0x3f); *s++ = 0x80 | (c & 0x3f); } This assumes that 'c' is a UTF-16 Unicode character in native byte order. If you really don't know the byte order, you'll need to find some way to guess. One way to guess is to assume that ASCII characters are common, in which case, you'll see things with the high order byte 0. In some environments, a "Byte-order mark" is used as the first character. This is character 0xFEFF. (The byte-swapped 0xFFFE is illegal, so if you see that, you know you've got the wrong byte order.) Good luck! Tim