From owner-freebsd-ports@FreeBSD.ORG  Sun Sep 30 12:55:53 2012
Return-Path: <owner-freebsd-ports@FreeBSD.ORG>
Delivered-To: freebsd-ports@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 09257106564A
	for <freebsd-ports@freebsd.org>; Sun, 30 Sep 2012 12:55:53 +0000 (UTC)
	(envelope-from freebsd@grem.de)
Received: from mail.grem.de (outcast.grem.de [213.239.217.27])
	by mx1.freebsd.org (Postfix) with SMTP id 6D6208FC08
	for <freebsd-ports@freebsd.org>; Sun, 30 Sep 2012 12:55:51 +0000 (UTC)
Received: (qmail 22814 invoked by uid 89); 30 Sep 2012 12:55:49 -0000
Received: from unknown (HELO bsd64.grem.de) (mg@grem.de@80.137.96.220)
	by mail.grem.de with ESMTPA; 30 Sep 2012 12:55:49 -0000
Date: Sun, 30 Sep 2012 14:55:48 +0200
From: Michael Gmelin <freebsd@grem.de>
To: freebsd-ports@freebsd.org
Message-ID: <20120930145548.59b03149@bsd64.grem.de>
In-Reply-To: <20120930050803.7914caf6@bsd64.grem.de>
References: <20120930050803.7914caf6@bsd64.grem.de>
X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.6; amd64-portbld-freebsd9.0)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: Problems submitting patch containing UTF-8 characters
X-BeenThere: freebsd-ports@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Porting software to FreeBSD <freebsd-ports.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-ports>,
	<mailto:freebsd-ports-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-ports>
List-Post: <mailto:freebsd-ports@freebsd.org>
List-Help: <mailto:freebsd-ports-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-ports>,
	<mailto:freebsd-ports-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Sep 2012 12:55:53 -0000


On Sun, 30 Sep 2012 05:08:03 +0200
Michael Gmelin <freebsd@grem.de> wrote:

> Hi,
> 
> I recently ran into a problem submitting a PR containing UTF-8
> characters, they ended up garbled, so the maintainer couldn't apply
> the patch cleanly.
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=171645
> 
> The characters included were 0xe4 0xb8 0xad and 0xe5 0x9b 0xbd (two
> three byte characters). The code affected is about testing utf-8, so
> the characters are required. And even if not, patching them away would
> require stating them as part of the patch.
> 
> The original e-mail was created using porttools and therefore had no
> character set specification, which usually shouldn't be a problem. The
> patch was just inline as part of the body.
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=1
> 
> The character sequence had been recoded to
> 0xc3 0xa4 0xc2 0xb8 0xc2 0xad 0xc3 0xa5 0xc2 0x9b 0xc2 0xbd
> 
> It seems like it had been interpreted as latin1 on receipt and then
> reencoded as utf-8:
> 0xe4 => 0xc3 0xa4
> 0xb8 => 0xc2 0xb8
> 0xad => 0xc2 0xad
> 0xe5 => 0xc3 0xa5
> 0x9b => 0xc2 0x9b
> 0xbd => 0xc2 0xbd
> 
> Which is obviously not what should happen. The recipient shouldn't
> make any assumptions about the character set used.
> 
> The next attempt was sending the patch as a bug-followup through a
> graphical MUA. The patch was attached and had been encoded as
> quoted-printable (no specific charset specification):
> 
> +-configPath =3D u"./config/=E4=B8=AD=E5=9B=BD_client.config"
> ++configPath =3D
> u"./config/=E4=B8=AD=E5=9B=BD_client.config".encode("utf-8=")
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=2
> 
> Unfortunately the results are the same. I did not try forcing a
> charset by manually modifying the email (not sure if this will work,
> I'm willing to test, but I don't want to further litter that PR).
> 
> At this point I figured, that sending the patch in gzipped format
> might help. Said and done, the patch shows up as base64 in the PR.
> When copy and pasting and decoding the base64 text, the resulting .gz
> can be decompressed correctly and the content is what I expected. When
> clicking the download link though:
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=171645&getpatch=3
> 
> The resulting .gz file has the correct file size, but is corrupted.
> Checking it using the hex editor it looks like it has been reencoded
> as utf-8 (and then truncated at the expected file size):
> 
> Hex of the original file (first 16 bytes):
> 1f 8b 08 08 ad 79 65 50  00 03 70 79 32 37 2d 49
> 
> Hex of the file downloaded by using the link:
> 1f c2 8b 08 08 c2 ad 79  65 50 00 03 70 79 32 37
> 
> As you can see, all non 7bit characters have been utf-8 encoded, which
> is pretty suboptimal in a binary file.
> 
> 0x8b => 0xc2 0x8b
> 0xad => 0xc2 0xad
> ...
> 
> As a result the truncated and utf-8 encoded gzip file cannot be
> decompressed.
> 
> I'm relatively certain that this has worked at some point in the past.
> 
> Ideas anyone?
> 
> Thanks,
> 

By the way, the two three byte sequences mean
"China", see also http://goo.gl/4muUF

-- 
Michael Gmelin