Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 3 Jul 2018 23:10:02 +0200
From:      Jilles Tjoelker <jilles@stack.nl>
To:        Hiroki Sato <hrs@FreeBSD.org>
Cc:        daichigoto@icloud.com, lists@eitanadler.com, daichi@freebsd.org, gnn@FreeBSD.org, cem@freebsd.org, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r335836 - head/usr.bin/top
Message-ID:  <20180703211002.GA11832@stack.nl>
In-Reply-To: <20180703.020956.859981414196673670.hrs@allbsd.org>
References:  <CAF6rxg=Zjkf6EbSgt1fBQBUDHGKWwLf=n9ZJweJH%2BDi800kJ3w@mail.gmail.com> <20180702.155529.1102410939281120947.hrs@allbsd.org> <459BD898-8072-426E-A968-96C1382AC616@icloud.com> <20180703.020956.859981414196673670.hrs@allbsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jul 03, 2018 at 02:09:56AM +0900, Hiroki Sato wrote:
> 後藤大地 <daichigoto@icloud.com> wrote
>   in <459BD898-8072-426E-A968-96C1382AC616@icloud.com>:
> da> > 2018/07/02 15:55、Hiroki Sato <hrs@FreeBSD.org>のメール:

> da> > Eitan Adler <lists@eitanadler.com> wrote
> da> >  in <CAF6rxg=Zjkf6EbSgt1fBQBUDHGKWwLf=n9ZJweJH+Di800kJ3w@mail.gmail.com>:

> da> > li> On 1 July 2018 at 10:08, Conrad Meyer <cem@freebsd.org> wrote:

> da> > li> > I don't think code to decode UTF-8 belongs in top(1).  I don't know
> da> > li> > what the goal of this routine is, but I doubt this is the right way to
> da> > li> > accomplish it.
> da> > li>
> da> > li> For the record, I agree. This is why I didn't click "accept" on the
> da> > li> revision. I don't fully oppose leaving it in top(1) for now as we work
> da> > li> out the API, but long term its the wrong place.
> da> > li>
> da> > li> https://reviews.freebsd.org/D16058 is the review.

> da> > I strongly object this kind of encoding-specific routine.  Please
> da> > back out it.  The problem is that top(1) does not support multibyte
> da> > encoding in functions for printing, and using C99 wide/multibyte
> da> > character manipulation API such as iswprint(3) is the way to solve
> da> > it.  Doing getenv("LANG") and assuming an encoding based on it is a
> da> > very bad practice to internationalize software.

> da> I respect what you mean.

> da> Once I back out, I will begin implementing it in a different way.
> da> Please advise which function should be used for implementation
> da> (iswprint (3) and what other functions should be used?)

>  Roughly speaking, POSIX/XPG/C99 I18N model requires the following
>  steps:

>  1. Call setlocale(LC_ALL, "") first.

>  2. Use mbs<->wcs and/or mb<->wc conversion functions in C95/C99 to
>     manipulate characters and strings depending on what you want to
>     do.  The printable() function should use mbtowc(3) and
>     iswprint(3), for example.  And wcslen(3) should be used to
>     determine the length of characters to be printed instead of
>     strlen().

>     Note that if mbs->wcs or mb->wc conversion fails with EILSEQ at
>     some point, some of the character(s) are invalid for printing.
>     This can happen because command-line parameters in top(1) are not
>     always encoded in one specified in LC_CTYPE or LANG.  It should
>     also be handled as non-printable.  However, to make matters worse,
>     each process does not always use a single, same locale as top(1).
>     A process invoked with LANG=ja_JP.eucJP may have EUC-JP characters
>     in its ARGV array even if top(1) runs by another user whose LANG
>     is en_US.UTF-8.  You have to determine which locale should be used
>     before doing mb->wc conversion.  It is not so simple.

>  3. Print the multibyte characters by using strvisx(3) family, which
>     supports multibyte character, or swprintf(3) family if you want to
>     format wide characters directly.  Note that buffer length for
>     strvisx(3) must be calculated by using MB_LEN_MAX.

In this case, calling setlocale() and then using strvisx() seems the
right solution. If locales differ across processes this may result in
mojibake but that cannot really be helped. Even analyzing other
processes' locale variables is not fully reliable, since strings may be
incorrectly encoded even in the process's real locale, environment
variables cannot be read across users and the environment block may be
overwritten by a program.

In general, although using conversion to wide characters allows users a
lot of flexibility, I don't think it is the best in all situations:

* The result of mbstowcs() is a UTF-32 string which consumes a lot of
  memory. A loop with mbrtowc() may also be slow. Many operations can be
  done directly on UTF-8 strings with no or little additional complexity
  compared to byte strings.

* If there is an invalid multibyte character, there is little
  flexibility to handle this usefully and securely, since so little is
  known about the encoding. The best handling may depend on the context.

Therefore, in /bin/sh, I have only implemented multibyte support for
UTF-8. All other encodings have bytes treated as characters.

However, I do agree that getenv("LANG") is bad. Instead, setlocale()
should be used. After that, nl_langinfo(CODESET) can be called and the
result compared to "UTF-8".

-- 
Jilles Tjoelker



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180703211002.GA11832>