Re: character sets in HTTP: translation tables
Gavin Nicol (gtn@ebt.com)
Thu, 29 Dec 1994 08:19:44 -0500
Larry Masinter writes:
>Unicode is one of the answers. It is perfectly possible to do
>multi-lingual HTML without Unicode, for those who would prefer to do
>so.
I have never said that Unicode is "the only answer", and in fact, I
specifically mention ways of *not* using it in my paper. Indeed,
there is absolutely no reason why what you outline, and what I
describe in my paper, cannot coexist.
>Unicode may be the answer for you and for me, but it's become
>abundantly clear that it isn't the answer for some folks, and we don't
>actually need to force the issue this way.
I have heard many objections to Unicode, and upon discussion, I have
found that most are related to display, or unsupported language
feature issues. If we have "some folks" out there, please raise your
objections now: I would be very interested in hearing them, and even
more interested to hear if some kind of "hint" is not sufficient to
handle the objection.
>I've found the capabilities alluded to in <URL:http://www.nada.kth.se/
Yes, this is interesting stuff, and can be used for part of what I
talked about.
>Receivers might be expected to either (1) understand the character
>sets the sender is able to present or (2) call some proxy that is able
>to translate the document into a character set the client knows.
.....
[Some detail deleted]
>Current practice is that servers offer documents in the character sets
>they have, and clients either display them correctly or attempt to
>translate them. If you browse the net looking for web sites by
>country, you'll come up with lots of examples.
....
[Some examples of charset specific browsers and servers deleted]
>In general, users with large collections of documents in national
>character sets will just make them available in those forms.
Because there is nothing else to make them available in?
>They're currently expecting the reader, if they're really interested,
>to obtain support for the character encoding used in the documents on
>those servers. A regime where the documents are properly labelled, and
>the client fetches the translation table the first time in order to
>properly display the documents:
>
> a) will work well
This seems a rather premature claim.
> b) is efficient
Hmm. When I think of 200,000 accesses from the US to a Japanese site
from a browser that does not support SJIS, EUC, Arabic, and Thai,
all of which are used within a single document (lord only knows
how!), resulting in 200,000 requests for each of those translation
tables, and compare this to a single server "wasting" cycles doing a
conversion to UTF-8 (which would probably be cached after the first
access), I have trouble imagining this to be more efficient.
Requiring the browsers to call a proxy is at least as inefficient as
server conversion (actually more so due to redirect), and relies
upon the same (or probably, more complicated) caching techniques for
performance.
Worse, doing in-client translation will almost certainly require
16 bits *somewhere* within the client... and what is everything
being translated *into*? Unless you are translating into some
virtual character set (which would need to be at least as large as
the largest national character set: read 16 bits), or into Unicode,
or transliterating, you will also need to reprogram the parsers for
the different data formats you purport to support, so that they will
be able to recognise the characters as belonging to a particular
class of characters to be used in parsing (think of SGML parsing, or
even HTML parsing.) In addition, your proposal generally assumes a
single language, and a single character set per document, mine
assumes (in the "core" case), a single character set and multiple
languages (though your proposal can indeed emulate mine via proxy).
My proposal requires people to update clients once (to achieve basic
functionality). Your proposal requires alls client to also be changed
at least once to add support for translation tables (unless they be
forever stuck with transliteration, and the overhead thereof). My
proposal requires changes to servers, so does your proposal.
c) has a reasonable transition from current practice.
True. So does my proposal. In the worste case, current practise (or
something close to it) is supported.
>The servers don't have the computational resources to translate the
>documents on-the-fly to Unicode, nor is such translation particularly
>efficient in terms of network bandwidth.
I don't buy this, especially if the server uses good caching
techniques.
The biggest problem I have with this idea is that in the worste case
(and indeed, I would argue in many cases), the client and server
will not be able to communicate without help. Additionally, it
encourages the creation of browsers with widely disparate
capabilities, and version numbers (read: bug lists), and breaking the
web into "villages" with their own culture, own language, own GUI,
etc.
We need a common base, a lingua franca, so that in the worste case
(no external help available), we can communicate. Unicode is the
only thing I know of that offers a reasonable many->one->many
path. Additionally, it promises to offer a *single* processing model
for character data of *mixed* languages (at least until it needs to
be displayed), and most (all?) parsers could use Unicode a base for
character classification, and hence, the parsing tables would not
need to be reprogrammed. Eventually, free tools, and libraries
*will* appear to aide in processing, and display.
> accept-parameter: charset=unicode-1-1-utf7
OK. This looks fine to me. (Perhaps utf8 or ucs-2??)
>The MIME draft standard makes no such claims. There is a document
>being circulated by the mail extensions working group which makes
Thank you for clarifying this.
BTW. I tend to beleive that unless a program was specifically designed
to manipulate charsets and/or encodings, that the actual processing
within it can be abstracted away from concrete codes. If a
program/system relies upon concrete codes, it is BAD (Broken As
Designed) in my books, though certainly, using Unicode codes gives a
greater range of flexibility than other code sets.