From: Ad.Irvine@Queens-Belfast.AC.UK
Date: Fri, 23 Dec 94 14:33 GMT
Subject: [www-mling,00139] Unicode and WWW
Message-Id: <199412231434.XAA17481@mail.core.ntt.jp>


Hello, Saluton, Dia Duit, Bonjour.

Hi, I'm new to this group, which I'm very happy to have found.
I really dislike the Latin1 straightjacket around the WWW and
hope that this group will be able to create and launch a solution
framework well before the end of 1995.  WWW is growing so fast
that we must act with sufficient speed so as to avoid any inferior
de facto standard winning the race unfairly.

Return-Path: what I've read so far on this list it seems that the first
question that needs answering is whether to choose (a) (b) (c)
or (d) for WWW servers and browsers:
(a) Unicode only,
(b) Unicode based but allow for other encodings,
(c) multiple encoding methods allowed (one of which is Unicode),
(d) something that doesn't involve Unicode.

We need something with codepoints covering all the writing systems
of all the languages in the world because it seems reasonable to
predict that within 100 or 200 years every village will have WWW
access.  There are currently 5000+-1000 languages (although all
with fewer than 100,000 speakers are generally considered to be at
risk), but happily for us the number of scripts is considerably less.
It is also easy to predict that computing will leave 8-bit based for
32-bit based, passing through 16-bit-ville on the way.

It seems to me, that the Unicode people have seen this future and
are planning well in advance.  (Isn't it wonderful to see an absence 
of today-ism).  I note that Windows NT (and to a small extent Windows
95) use Unicode.  From the media I (surely like many others) have
absorbed the idea and feeling that Unicode is the way of the multi-
lingual future free of code-page hassles and nightmares.  I am
therefore confident that we can (and already on this list do) strike
off option (d).  Browsers and servers must be designed to understand
Unicode now, or at least be designed so that Unicode support can
easily be added in the future.

To decide between (a) (b) and (c) we need to consider the pros, cons,
and solutions to cons, for Unicode as applied to the WWW.  The pros
are quite obvious (it's the universal character set dream :-)  But,
there are some cons to striving for Unicode in WWW at the present time.
I've thought of a few and listed them below (feel free to refine,
criticise, praise, correct, modify, add, etc):

(1) "The horse is out of the barn".  No it isn't!  It's only partly
out!  Most things in the web at the moment are in ASCII or Latin1 -
that essentially means that most WWW stuff at the present is written
in Unicode.  A very important question from the pro-Unicode point of 
view is how easy is it going to be to convert the non-Unicode texts.
Maybe I'm naive or overoptimistic, but I would tend to believe that
Unicode translators and editors should be a nonproblem.  Could we
build here a list of such already in existence.  Finally remember,
the non-Unicode WWW texts are very few compared to the Latin1 WWW texts 
and the amount of information that will be added (hopefully in Unicode)
to the web during the next century.

(2) "The Unicode font set will devour my disk space".  Wrong!  If I
have ISO-8859-1 -2 and -3 fonts installed on my machine, I'll have "A"
to "z" glyphs defined thrice.  With the integrated approach of Unicode
I'll only have them defined once, so saving me disk space.  In addition,
one could (using a Unicode tool that should come free with the WWW 
browser) archive or delete any glyph set that you're unlikey to ever or
frequently use.  For example, _I_ would archive the Hangul stuff; but if
ever I came across a WWW page with Hangul in it, the brower should bring
up a dialog box saying "Hangual fonts not ready on your machine. [PRESS]
to unarchive, [PRESS] to fetch from the net, [PRESS] to use the glyph-
not-found glyph, [PRESS] to use ASCII/etc backup chars/strings".  (The 
ASCII backup should be user-definable.)

(3) "Unicode fonts just aren't available".  Again maybe I'm being naive,
but I imagine that all we need to get the WWW Unicode show on the road
so to speak, is to create just one Unicode font that contains the more 
commonly used characters/scripts.  This should not be too difficult as
the glyphs already exits, albeit scattered across many files and several
platforms.  So, an important question is whether there already exists
a tool to create a Unicode font out of other fonts.  It's a matter of
compiling and recoding.  (PS.  How to define "more commonly used"?)

(4) "WWW will be slowed to snail speed by _16_ bit Unicode".  Because
the network is so damn slow at present, this is a worry that must be 
addressed.  To solve this we could send Unicode 8-bit.  (This also
will save server disk space.)

To achieve this, I think that use should be made of some of the Unicode
control characters as language tags.  (Are these control characters
free to be used in this fashion?)  In particular, we could assign the 
following meanings (assuming the transferred WWW file is 8-bit based):
00 Encoding scheme "selector", 01 and 02 Encoding "shortcuts".  For example:

00 00 Latin1 encoding (the default)
00 01 1-byte Unicode (ie. Latin1 unless codes 01 or 02 are used; see below)
00 02 2-byte Unicode
00 04 4-byte ISO 10646
00 13 ISO-8859-3
00 zz Encoding method zz
01 xx The one byte xx forms the Unicode character hex_01xx
02 xx yy The two bytes form the Unicode character hex_yyxx. (Byte order?)
The browser should have the ability to interpret these control characters,
and act as appropriate if possible.  The control codes can be used at any 
point in the document.


--

In conclusion I favour option (b) : emphasis on Unicode.

--

Aaron David IRVINE, PhD.


From: "Dan Kegel" <dank@knowledge.adventure.com>
Date: Fri, 23 Dec 1994 01:58:04 -0800 (PST)
Subject: [www-mling,00138] HTML in Unicode? (fwd)
Message-Id: <9412230158.AA03110@knowledge.adventure.com>


Forwarded message:
> Newsgroups: comp.software.international,comp.text.sgml
> From: David_Goldsmith@taligent.com (David Goldsmith)
> Date: Thu, 15 Dec 1994 02:51:13 GMT
> 
> In article <3cm88r$t88@gap.cco.caltech.edu>, dank@alumni.caltech.edu
> (Daniel R. Kegel) wrote:
> > There is a multilingual WWW mailing list you might be interested in,
> > www-mling@square.ntt.jp.  Also check out the homepage of its author,
> > <a href="http://www.ntt.jp/people/takada/takada.html">takada</a>
> 
> There is also a mailing list for discussing use of Unicode with WWW. Send
> subscription requests to www-request@unicode.org.

Even if your first reaction is "But Unicode isn't the answer!", no need to 
worry.  I bet the folks who support Unicode for HTML will also be happy 
to support ISO2022 and existing national character sets (right, guys?).
They love the fact that Unicode makes a great intermediate representation
for translating between character sets, and it's reasonable to have a
place to discuss this & other aspects of using Unicode on the Web.
In fact, if you don't watch out, you may find yourself considering using
Unicode internally as part of your character set translation process :-)
- Dan