From owner-freebsd-ports@FreeBSD.ORG Sun Sep 30 12:55:53 2012 Return-Path: Delivered-To: freebsd-ports@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 09257106564A for ; Sun, 30 Sep 2012 12:55:53 +0000 (UTC) (envelope-from freebsd@grem.de) Received: from mail.grem.de (outcast.grem.de [213.239.217.27]) by mx1.freebsd.org (Postfix) with SMTP id 6D6208FC08 for ; Sun, 30 Sep 2012 12:55:51 +0000 (UTC) Received: (qmail 22814 invoked by uid 89); 30 Sep 2012 12:55:49 -0000 Received: from unknown (HELO bsd64.grem.de) (mg@grem.de@80.137.96.220) by mail.grem.de with ESMTPA; 30 Sep 2012 12:55:49 -0000 Date: Sun, 30 Sep 2012 14:55:48 +0200 From: Michael Gmelin To: freebsd-ports@freebsd.org Message-ID: <20120930145548.59b03149@bsd64.grem.de> In-Reply-To: <20120930050803.7914caf6@bsd64.grem.de> References: <20120930050803.7914caf6@bsd64.grem.de> X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.6; amd64-portbld-freebsd9.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: Problems submitting patch containing UTF-8 characters X-BeenThere: freebsd-ports@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting software to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Sep 2012 12:55:53 -0000 On Sun, 30 Sep 2012 05:08:03 +0200 Michael Gmelin wrote: > Hi, > > I recently ran into a problem submitting a PR containing UTF-8 > characters, they ended up garbled, so the maintainer couldn't apply > the patch cleanly. > > http://www.freebsd.org/cgi/query-pr.cgi?pr=171645 > > The characters included were 0xe4 0xb8 0xad and 0xe5 0x9b 0xbd (two > three byte characters). The code affected is about testing utf-8, so > the characters are required. And even if not, patching them away would > require stating them as part of the patch. > > The original e-mail was created using porttools and therefore had no > character set specification, which usually shouldn't be a problem. The > patch was just inline as part of the body. > > http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=1 > > The character sequence had been recoded to > 0xc3 0xa4 0xc2 0xb8 0xc2 0xad 0xc3 0xa5 0xc2 0x9b 0xc2 0xbd > > It seems like it had been interpreted as latin1 on receipt and then > reencoded as utf-8: > 0xe4 => 0xc3 0xa4 > 0xb8 => 0xc2 0xb8 > 0xad => 0xc2 0xad > 0xe5 => 0xc3 0xa5 > 0x9b => 0xc2 0x9b > 0xbd => 0xc2 0xbd > > Which is obviously not what should happen. The recipient shouldn't > make any assumptions about the character set used. > > The next attempt was sending the patch as a bug-followup through a > graphical MUA. The patch was attached and had been encoded as > quoted-printable (no specific charset specification): > > +-configPath =3D u"./config/=E4=B8=AD=E5=9B=BD_client.config" > ++configPath =3D > u"./config/=E4=B8=AD=E5=9B=BD_client.config".encode("utf-8=") > > http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=2 > > Unfortunately the results are the same. I did not try forcing a > charset by manually modifying the email (not sure if this will work, > I'm willing to test, but I don't want to further litter that PR). > > At this point I figured, that sending the patch in gzipped format > might help. Said and done, the patch shows up as base64 in the PR. > When copy and pasting and decoding the base64 text, the resulting .gz > can be decompressed correctly and the content is what I expected. When > clicking the download link though: > > http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=3 > > The resulting .gz file has the correct file size, but is corrupted. > Checking it using the hex editor it looks like it has been reencoded > as utf-8 (and then truncated at the expected file size): > > Hex of the original file (first 16 bytes): > 1f 8b 08 08 ad 79 65 50 00 03 70 79 32 37 2d 49 > > Hex of the file downloaded by using the link: > 1f c2 8b 08 08 c2 ad 79 65 50 00 03 70 79 32 37 > > As you can see, all non 7bit characters have been utf-8 encoded, which > is pretty suboptimal in a binary file. > > 0x8b => 0xc2 0x8b > 0xad => 0xc2 0xad > ... > > As a result the truncated and utf-8 encoded gzip file cannot be > decompressed. > > I'm relatively certain that this has worked at some point in the past. > > Ideas anyone? > > Thanks, > By the way, the two three byte sequences mean "China", see also http://goo.gl/4muUF -- Michael Gmelin