Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 15 Nov 2001 21:52:44 +0200
From:      Alexey Zelkin <phantom@FreeBSD.ORG>
To:        Hiroki Sato <hrs@eos.ocn.ne.jp>
Cc:        horcicka@FreeBSD.cz, freebsd-doc@FreeBSD.ORG, nik@FreeBSD.ORG, saken@hotel.rmta.org
Subject:   Re: Why TIDY can never work correctly with ISO-8859-2 and others
Message-ID:  <20011115215244.A7285@ark.cris.net>
In-Reply-To: <20011115160532.A61351@ark.cris.net>; from phantom@FreeBSD.ORG on Thu, Nov 15, 2001 at 04:05:32PM %2B0200
References:  <20011115105650.W57038-100000@dual.ms.mff.cuni.cz> <20011115.214017.71143189.hrs@sekine00.ee.noda.sut.ac.jp> <20011115160532.A61351@ark.cris.net>

next in thread | previous in thread | raw e-mail | index | archive | help

--OgqxwSJOaUobr8KG
Content-Type: text/plain; charset=us-ascii

[ Cc'ed to maintainer of ports/www/tidy ]

hi,

Attached patch does a job. At least my simple tests were passed successfully.
I just added new option '-preserve' to tidy. This option disables
translation of characters entities to characters before processing.
As "side effect" we have all entities saved correctly in output file.

I would like to have feedback on this one. At least for Russian Doc Project
it should do a good job and I'd like to see it commited.

On Thu, Nov 15, 2001 at 04:05:32PM +0200, Alexey Zelkin wrote:
> hi,
> 
> On Thu, Nov 15, 2001 at 09:40:17PM +0900, Hiroki Sato wrote:
> 
> > horcicka> And if you use char-encoding: raw - character entities with values above 255
> > horcicka> are not printed as entities - this is really bad in 8-bit encodings.
> > 
> >  Yes, Japanese docs also suffer from it.  The input routine of tidy expands
> >  any entities first, even if -raw flag is specified.
> > 
> > horcicka> In my opinion Tidy cannot be used for encodings it does not natively support
> > horcicka> (i.e. for Russian and Czech (- still not in main CVS) translations of pages
> > horcicka> and docs).
> > 
> >  I think so, too.
> > 
> >  As a workaround, we can apply a patch and use the modified
> >  version of tidy that can suppress to interpret given entities
> >  as entities themselves, but I do not know if it will be a good solution.
> 
> Most noticeable problem of -raw case
> is converting &nbsp; to character with code 160. As enough
> workaround for Russian translation we've used -latin1 case, but
> anyway expanding of all entities except &nbsp; and &amp; is bad.
> 
> I am working on patch for tidy(1) to add new option which should
> supress all entity -> character recoding. Hope it should be enough.

--OgqxwSJOaUobr8KG
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="tidy.preserve.entities.patch"

diff -u work/tidy4aug00/config.c tidy4aug00.patched/config.c
--- work/tidy4aug00/config.c	Fri Aug  4 19:21:05 2000
+++ tidy4aug00.patched/config.c	Thu Nov 15 21:55:25 2001
@@ -94,6 +94,7 @@
 Bool TidyMark = yes;        /* add meta element indicating tidied doc */
 Bool Emacs = no;            /* if true format error output for GNU Emacs */
 Bool LiteralAttribs = no;   /* if true attributes may use newlines */
+Bool PreserveEntities = no; /* if true don't convert entities to chars */
 
 typedef struct _lex PLex;
 
@@ -186,6 +187,7 @@
     {"doctype",         {(int *)&doctype_str},      ParseDocType},
     {"fix-backslash",   {(int *)&FixBackslash},     ParseBool},
     {"gnu-emacs",       {(int *)&Emacs},            ParseBool},
+    {"preserve-entities", {(int *)&PreserveEntities}, ParseBool},
 
   /* this must be the final entry */
     {0,          0,             0}
@@ -533,6 +535,12 @@
     {
         QuoteAmpersand = yes;
         HideEndTags = no;
+    }
+
+ /* Avoid &amp;copy; in preserve-entities case */
+    if (PreserveEntities)
+    {
+       QuoteAmpersand = no;
     }
 }
 
diff -u work/tidy4aug00/html.h tidy4aug00.patched/html.h
--- work/tidy4aug00/html.h	Fri Aug  4 19:21:05 2000
+++ tidy4aug00.patched/html.h	Thu Nov 15 21:55:26 2001
@@ -758,6 +758,7 @@
 extern Bool Word2000;
 extern Bool Emacs;  /* sasdjb 01May00 GNU Emacs error output format */
 extern Bool LiteralAttribs;
+extern Bool PreserveEntities;
 
 /* Parser methods for tags */
 
diff -u work/tidy4aug00/lexer.c tidy4aug00.patched/lexer.c
--- work/tidy4aug00/lexer.c	Fri Aug  4 19:21:05 2000
+++ tidy4aug00.patched/lexer.c	Thu Nov 15 21:55:26 2001
@@ -1517,8 +1517,10 @@
 
                     continue;
                 }
-                else if (c == '&' && mode != IgnoreMarkup)
-                    ParseEntity(lexer, mode);
+                else if (c == '&' && mode != IgnoreMarkup
+				&& !PreserveEntities) {
+               		ParseEntity(lexer, mode);
+		}
 
                 /* this is needed to avoid trimming trailing whitespace */
                 if (mode == IgnoreWhitespace)
@@ -2624,7 +2626,7 @@
                 seen_gt = yes;
         }
 
-        if (c == '&')
+        if (c == '&')	/* XXX: possibly need support for PreserveEntities */
         {
             AddCharToLexer(lexer, c);
             ParseEntity(lexer, null);
diff -u work/tidy4aug00/localize.c tidy4aug00.patched/localize.c
--- work/tidy4aug00/localize.c	Fri Aug  4 19:21:05 2000
+++ tidy4aug00.patched/localize.c	Thu Nov 15 21:55:26 2001
@@ -736,6 +736,7 @@
     tidy_out(out, "  -xml            use this when input is wellformed xml\n");
     tidy_out(out, "  -asxml          to convert html to wellformed xml\n");
     tidy_out(out, "  -slides         to burst into slides on h2 elements\n");
+    tidy_out(out, "  -preserve       to preserve entities as is in source file\n");
     tidy_out(out, "\n");
 
     tidy_out(out, "Character encodings\n");
diff -u work/tidy4aug00/man_page.txt tidy4aug00.patched/man_page.txt
--- work/tidy4aug00/man_page.txt	Fri Aug  4 19:21:05 2000
+++ tidy4aug00.patched/man_page.txt	Thu Nov 15 21:55:26 2001
@@ -12,6 +12,7 @@
 .IR column ]
 .RB [ -upper ]
 .RB [ -clean ]
+.RB [ -preserve ]
 .RB [ -raw
 |
 .B -ascii
@@ -106,6 +107,9 @@
 .TP
 .B -slides
 Burst into slides on <H2> elements.
+.TP
+.B -preserve
+Preserve source file entities as is.
 .TP
 .BR -help ", " -h
 List command-line options.
diff -u work/tidy4aug00/tidy.c tidy4aug00.patched/tidy.c
--- work/tidy4aug00/tidy.c	Fri Aug  4 19:21:05 2000
+++ tidy4aug00.patched/tidy.c	Thu Nov 15 21:55:26 2001
@@ -785,6 +785,8 @@
                 Quiet = yes;
             else if (strcmp(arg, "slides") == 0)
                 BurstSlides = yes;
+            else if (strcmp(arg, "preserve") == 0)
+                PreserveEntities = yes;
             else if (strcmp(arg, "help") == 0 ||
                      argv[1][1] == '?'|| argv[1][1] == 'h')
             {

--OgqxwSJOaUobr8KG--

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-doc" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20011115215244.A7285>