Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 30 Sep 2012 23:20:11 GMT
From:      Michael Gmelin <freebsd@grem.de>
To:        freebsd-www@FreeBSD.org
Subject:   Re: www/172195: PR database corrupts patches
Message-ID:  <201209302320.q8UNKBaV062869@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR www/172195; it has been noted by GNATS.

From: Michael Gmelin <freebsd@grem.de>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: www/172195: PR database corrupts patches
Date: Mon, 1 Oct 2012 01:12:00 +0200

 Analysis:
 
 1. The PR system assumes some different encoding than UTF-8 to be the
    default. This means:
    a) Patches uploaded through the web form will corrupt
    b) Patches mailed as attachments without explicit charset
       specification will corrupt
    c) standard send-pr patches break - adding a charset UTF-8 header
       manually will probably work, but is too easy to forget. Also
       won't fix the download option.
 
 2. The PR system can handle binary attachments correctly in its base64
    view
 
 3. Downloaded patches are corrupted in all cases!
    a) File attached via webform:
       fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=1" | hd
       c3 a4 c2 b8 c2 ad
       (should have been: e4 b8 ad e5 9b bd)
       This looks like the input has been assumed to be latin1,
       transcoded to UTF-8 and truncated.      
       
    b) File sent as follow up attachment without UTF-8 charset:
       fetch -o -
       "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=2" |
       hd c3 a4 c2 b8 c2 ad c3 a5
       (should have been: e4 b8 ad e5 9b bd)
       This looks like the input has been assumed to be latin1 and
       transcoded to UTF-8.
 
    c) File sent as follow up attachment WITH UTF-8 charset:
       (this one shows up correctly on the web page, the download is
        still broken though):
       fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=3" | hd
       e4 b8 ad e5
       (should have been: e4 b8 ad e5 9b bd)
       This looks like it got the encoding right, but can't handle three
       byte characters (string length calculation problem?!)
 
    d) Gzipped version of the patch:
       The base64 encoded version shown on the PR webpage is correct:
       md5 china.txt.gz
       MD5 (china.txt.gz) = 29009c79690c58b0762274da0e3ad80d
  
       echo "H4sICIG7aFAAA2NoaW5hLnR4dAB7smPt09l7uQC1SPS1BwAAAA==" \
       | openssl enc -d -a | md5
       29009c79690c58b0762274da0e3ad80d
 
       Downloading through the download link fails though:
       fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=4" | md5
       ae9f2f3531871be8c4af662863eb542e
 
       Taking a deeper look into the gzip file shows, that there has
       been an attempt to somehow UTF-8 encode the binary content:
 
       Original:
 00000000  1f 8b 08 08 81 bb 68 50  00 03 63 68 69 6e 61 2e
 00000010  74 78 74 00 7b b2 63 ed  d3 d9 7b b9 00 b5 48 f4
 00000020  b5 07 00 00 00
 00000025
 
       File as downloaded from the PR website:
 00000000  1f c2 8b 08 08 c2 81 c2  bb 68 50 00 03 63 68 69
 00000010  6e 61 2e 74 78 74 00 7b  c2 b2 63 c3 ad c3 93 c3
 00000020  99 7b c2 b9 00
 00000025
 
       As you can see, 8bit characters have been UTF-8 encoded, and the
       resulting file got truncated at the original file size.
 
 
 Conclusion:
 
 There is no simple way of submitting a patch through the PR system so
 that it can be downloaded using the download link. Right now the
 options are:
 
 1. Send the file as an email attachment, making sure that the character
    encoding in the mime header is set to UTF-8 (not all email clients
    will do this automatically). This way a patch can be acquired by
    using copy and paste - the download link will not work correctly
    though and yield surprising results. A patch acquired this way might
    actually apply, but cause unintended behavior.
 
 2. Send the file gzipped and make people use base64 decode to get the
    gzip. This way when the download link is used people will at least
    realize something went wrong.
 
 3. Base64 encode the patch before sending it, this way everything
    stays us-ascii and cannot be messed with by the PR system. Requires
    users to base64 decode on their own and makes it hard to argue about
    the patch in a way that's transparent to users of the web page.
 
 None of these options seem very appealing, especially since it makes it
 easy for people to get it wrong and hard to get it right - also various
 tools used by port maintainers (porttools, send-pr etc.) might not be
 prepared to support the user to get it right. There will be more and
 more UTF-8 encoded patches in the future, so I think this should be
 fixed.
 
 Suggested fixes:
 
 - Change the default encoding (the coding assumed when no encoding is
   specified) to UTF-8. This might not be practical in all cases, but
   should be discussed.
 - Make sure that the download option provides correct files (it should
   treat all files as binary and not try to alter them in any way).
 
 I hope all of this makes sense.
 
 -- 
 Michael Gmelin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201209302320.q8UNKBaV062869>