Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 May 2010 23:45:58 +0300
From:      Nikos Vassiliadis <nvass9573@gmx.com>
To:        Polytropon <freebsd@edvax.de>
Cc:        Gary Kline <kline@thought.org>, FreeBSD Mailing List <freebsd-questions@freebsd.org>
Subject:   Re: any shortcuts to doc to ascii?
Message-ID:  <4C002B86.5090007@gmx.com>
In-Reply-To: <20100528090057.87144ef4.freebsd@edvax.de>
References:  <20100527013843.GA40751@thought.org>	<20100527050302.da39c258.freebsd@edvax.de>	<20100527233607.GD19297@thought.org> <20100528090057.87144ef4.freebsd@edvax.de>

next in thread | previous in thread | raw e-mail | index | archive | help
Polytropon wrote:
> On Thu, 27 May 2010 16:36:08 -0700, Gary Kline <kline@thought.org> wrote:
>> 	i don't see any ascii suffix [for OOo].  i saved as .txt.
> 
> This should be right. The .txt extension refers to ASCII text,
> at least in standard-compliant operating systems.
> 
> 
> 
>> 	same krap.  the \x94, x9d, \x9c...  same with catdoc.  i'll
>> 	try antiword.  [forgot about that.  ]
> 
> This makes me believe that the original DOC file has been created
> with a wrong character set or language setting. "Windows" - as far
> as I know - does not use standard locales such as all other systems
> do, but uses an arbitrary setting.
> 

It is a valid UTF-8 encoded text:
[nik@moby ~]$ python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)' | file -
/dev/stdin: UTF-8 Unicode text

You'll be able to see the character if you fire up a UTF-8 capable 
terminal with proper locale settings.
[nik@moby ~]$ LC_ALL=en_US.UTF-8 xterm -u8

After that, just print the char:
python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)'
and use copy & paste to pass it to tr to translate it to something else, 
for example:
tr ' "'" < $file > $output

> Another idea may be that the character that you think should be
> an apostrophe isn't an apostrophe. I often do see this in german
> texts with misplaces apostrophes that are in fact accent grave
> or accent acute, or a character from UTF-8 that just looks like
> an apostrophe. For example, if the original document contains
> 
> 	We don`t
> 
> and this ` is not a real ', then conversion tools will of course
> use the "escape notation" for this unknown character.

Indeed, the standard tool for encoding translations, iconv, chocks on 
this. Yet, it worked when I tried to convert from utf-8 to greek 
encoding('iconv -f utf-8 -t iso-8859-7'). Some info on the char:
http://www.fileformat.info/info/unicode/char/2019/index.htm

HTH, Nikos



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C002B86.5090007>