From owner-freebsd-doc@FreeBSD.ORG Sun May 13 15:20:10 2007 Return-Path: X-Original-To: freebsd-doc@hub.freebsd.org Delivered-To: freebsd-doc@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 52F7616A405 for ; Sun, 13 May 2007 15:20:10 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [69.147.83.40]) by mx1.freebsd.org (Postfix) with ESMTP id 3A60413C469 for ; Sun, 13 May 2007 15:20:10 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.13.4/8.13.4) with ESMTP id l4DFK9kb000571 for ; Sun, 13 May 2007 15:20:09 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.13.4/8.13.4/Submit) id l4DFK9nS000569; Sun, 13 May 2007 15:20:09 GMT (envelope-from gnats) Date: Sun, 13 May 2007 15:20:09 GMT Message-Id: <200705131520.l4DFK9nS000569@freefall.freebsd.org> To: freebsd-doc@FreeBSD.org From: Jeroen Ruigrok van der Werven Cc: Subject: Re: docs/50211: [PATCH] doc.docbook.mk: fix textfile creation X-BeenThere: freebsd-doc@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Jeroen Ruigrok van der Werven List-Id: Documentation project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 13 May 2007 15:20:10 -0000 The following reply was made to PR docs/50211; it has been noted by GNATS. From: Jeroen Ruigrok van der Werven To: bug-followup@FreeBSD.org Cc: Subject: Re: docs/50211: [PATCH] doc.docbook.mk: fix textfile creation Date: Sun, 13 May 2007 16:59:23 +0200 A long overdue update I guess. Neither links or elinks will help for the multibyte environments of Chinese, Japanese, Korean and the likes. They simply do not understand encodings such as EucJP, SJIS, GB18030, GB2312, EucKR, or UTF-8. Using www/w3m-m17n I can at least view Japanese pages. Using a 'w3m -dump http://website > dump.txt' of a EucJP encoded page the resulting file is an UTF-8 encoded plain text file. The same also works for (X-)SJIS (Japanese), GB2312 (Chinese/PRC), EucKR (Korean), UTF-8, TIS-620 (Thai), Big5 (Taiwanese), VISCII (Vietnamese), and KOI8-U (Russian). I tried some ISO-8859 dumps as well (8859-6 for example as well as -7) and it all works fine. So my suggestion is to change HTML2TXT to use w3m from w3m-m17n. -- Jeroen Ruigrok van der Werven / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ Reality is an illusion, grimmer. The dreamlands are like masks within masks, and Time has no dominion beyond the Shroud...