Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 6 Mar 2000 02:14:55 +0000
From:      Nik Clayton <nik@freebsd.org>
To:        "Andrey A. Chernov" <ache@nagual.pp.ru>
Cc:        doc@freebsd.org, www@freebsd.org, phantom@freebsd.org, ru@freebsd.org
Subject:   Re: SGML->HTML: entities translation is broken for non-Latin1 charsets
Message-ID:  <20000306021454.A87062@catkin.nothing-going-on.org>
In-Reply-To: <20000304134300.A24194@nagual.pp.ru>; from Andrey A. Chernov on Sat, Mar 04, 2000 at 01:43:02PM %2B0300
References:  <20000304134300.A24194@nagual.pp.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Mar 04, 2000 at 01:43:02PM +0300, Andrey A. Chernov wrote:
> Looking at www.freebsd.org I found that sgml->html procedure replace
> things like &nbsp; &copy; etc. with their Latin1 8bit hardcoded values :-(

This is done by sgmlnorm.  Last time this issue came up I didn't have a 
good fix for it either. . .

Last time this came up, I spoke to the OpenJade maintainers, and got a
reply back from Matthias Clasen <clasen@pong.mathematik.uni-freiburg.de>
who said;

> sgmlnorm is not designed to do what you request.

which might be true, but doesn't really help us.

I've done some more digging, and I can at least point people in the right
direction.  I don't have the necessary skills to fix this, but perhaps the
following will lead someone in the right direction.

First off, sgmlnorm is part of Jade, and it's written in C++, which 
complicates things mightily.  I'm no C++ programmer, so I'm extrapolating
from my C and Perl knowledge here. . .

If you look in jade/style/sdata.h, you'll see an array that lists entity
numbers to entity names.  This is the root cause of the problem, and a 
typical line from that file is 

    { 0x00A9, "copy" },

which is why "&copy;" becomes "\a9" when a file is processed by sgmlnorm.

This file is used in jade/style/Interpreter.cxx to build an array of
structs, in this piece of code;

--
void Interpreter::installSdata()
{
  // This comes from uni2sgml.txt on ftp://unicode.org.
  // It is marked there as obsolete, so it probably ought to be checked.
  // The definitions of apos and quot have been fixed for consistency with XML.
  static struct {
    Char c;
    const char *name;
  } entities[] = {
#include "sdata.h"
  };
  for (size_t i = 0; i < SIZEOF(entities); i++)
    sdataEntityNameTable_.insert(makeStringC(entities[i].name), entities[i].c);
}
--

I assume that's building a lookup table, to map entity names to their
corresponding character codes.

The only other place sdataEntityNameTable is used is in the
Interpreter::sdataMap method.  That function is passed the entity name,
and a reference to a character to output, and alters the reference as
necessary, based upon the sdataEntityNameTable map.

The logic seems to be:

  1.  If the entity name is in sdataEntityNameTable then lookup its 
      replacement (e.g., "\a9") and return.

  2.  If it's not there, call convertUnicodeCharName() on it.  This is
      also defined in Interpreter.cxx, and is a simple switch().

  3.  If that step failed, return defaultChar, which seems to 0xfffd.

Most of the time, step (1) is going to succeed.

As you can see, this code is designed to convert entity names to their
numeric references (actually, to C++ chars), and a quick glance at the
surrounding and calling code shows that the assumption that the reference
passed to sdataMap is a single character is deeply embedded.  Changing it
will probably touch quite a lot of code.

Working backwards, the single character (Char c_) is defined in the
SdataNode class (a subclass of EntityRefNode) in spgrove/GroveBuilder.cxx.
The single character is private to the class, and can only be accessed
through the SdateNode::charChunk method.  A quick grep through the 
source tree shows lots of calls to charChunk() :-(

After that, I get a bit lost.  I haven't got the tools here to hold a 
full class hierarchy in my head. . .

But that's a start, if anyone wants to do some digging.

N
-- 
Internet connection, $19.95 a month.  Computer, $799.95.  Modem, $149.95.
Telephone line, $24.95 a month.  Software, free.  USENET transmission,
hundreds if not thousands of dollars.  Thinking before posting, priceless.
Somethings in life you can't buy.  For everything else, there's MasterCard.
  -- Graham Reed, in the Scary Devil Monastery


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-doc" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000306021454.A87062>