From: Ad.Irvine@Queens-Belfast.AC.UK Date: Fri, 23 Dec 94 14:33 GMT Subject: [www-mling,00139] Unicode and WWW Message-Id: <199412231434.XAA17481@mail.core.ntt.jp>
Hello, Saluton, Dia Duit, Bonjour. Hi, I'm new to this group, which I'm very happy to have found. I really dislike the Latin1 straightjacket around the WWW and hope that this group will be able to create and launch a solution framework well before the end of 1995. WWW is growing so fast that we must act with sufficient speed so as to avoid any inferior de facto standard winning the race unfairly. Return-Path: what I've read so far on this list it seems that the first question that needs answering is whether to choose (a) (b) (c) or (d) for WWW servers and browsers: (a) Unicode only, (b) Unicode based but allow for other encodings, (c) multiple encoding methods allowed (one of which is Unicode), (d) something that doesn't involve Unicode. We need something with codepoints covering all the writing systems of all the languages in the world because it seems reasonable to predict that within 100 or 200 years every village will have WWW access. There are currently 5000+-1000 languages (although all with fewer than 100,000 speakers are generally considered to be at risk), but happily for us the number of scripts is considerably less. It is also easy to predict that computing will leave 8-bit based for 32-bit based, passing through 16-bit-ville on the way. It seems to me, that the Unicode people have seen this future and are planning well in advance. (Isn't it wonderful to see an absence of today-ism). I note that Windows NT (and to a small extent Windows 95) use Unicode. From the media I (surely like many others) have absorbed the idea and feeling that Unicode is the way of the multi- lingual future free of code-page hassles and nightmares. I am therefore confident that we can (and already on this list do) strike off option (d). Browsers and servers must be designed to understand Unicode now, or at least be designed so that Unicode support can easily be added in the future. To decide between (a) (b) and (c) we need to consider the pros, cons, and solutions to cons, for Unicode as applied to the WWW. The pros are quite obvious (it's the universal character set dream :-) But, there are some cons to striving for Unicode in WWW at the present time. I've thought of a few and listed them below (feel free to refine, criticise, praise, correct, modify, add, etc): (1) "The horse is out of the barn". No it isn't! It's only partly out! Most things in the web at the moment are in ASCII or Latin1 - that essentially means that most WWW stuff at the present is written in Unicode. A very important question from the pro-Unicode point of view is how easy is it going to be to convert the non-Unicode texts. Maybe I'm naive or overoptimistic, but I would tend to believe that Unicode translators and editors should be a nonproblem. Could we build here a list of such already in existence. Finally remember, the non-Unicode WWW texts are very few compared to the Latin1 WWW texts and the amount of information that will be added (hopefully in Unicode) to the web during the next century. (2) "The Unicode font set will devour my disk space". Wrong! If I have ISO-8859-1 -2 and -3 fonts installed on my machine, I'll have "A" to "z" glyphs defined thrice. With the integrated approach of Unicode I'll only have them defined once, so saving me disk space. In addition, one could (using a Unicode tool that should come free with the WWW browser) archive or delete any glyph set that you're unlikey to ever or frequently use. For example, _I_ would archive the Hangul stuff; but if ever I came across a WWW page with Hangul in it, the brower should bring up a dialog box saying "Hangual fonts not ready on your machine. [PRESS] to unarchive, [PRESS] to fetch from the net, [PRESS] to use the glyph- not-found glyph, [PRESS] to use ASCII/etc backup chars/strings". (The ASCII backup should be user-definable.) (3) "Unicode fonts just aren't available". Again maybe I'm being naive, but I imagine that all we need to get the WWW Unicode show on the road so to speak, is to create just one Unicode font that contains the more commonly used characters/scripts. This should not be too difficult as the glyphs already exits, albeit scattered across many files and several platforms. So, an important question is whether there already exists a tool to create a Unicode font out of other fonts. It's a matter of compiling and recoding. (PS. How to define "more commonly used"?) (4) "WWW will be slowed to snail speed by _16_ bit Unicode". Because the network is so damn slow at present, this is a worry that must be addressed. To solve this we could send Unicode 8-bit. (This also will save server disk space.) To achieve this, I think that use should be made of some of the Unicode control characters as language tags. (Are these control characters free to be used in this fashion?) In particular, we could assign the following meanings (assuming the transferred WWW file is 8-bit based): 00 Encoding scheme "selector", 01 and 02 Encoding "shortcuts". For example: 00 00 Latin1 encoding (the default) 00 01 1-byte Unicode (ie. Latin1 unless codes 01 or 02 are used; see below) 00 02 2-byte Unicode 00 04 4-byte ISO 10646 00 13 ISO-8859-3 00 zz Encoding method zz 01 xx The one byte xx forms the Unicode character hex_01xx 02 xx yy The two bytes form the Unicode character hex_yyxx. (Byte order?) The browser should have the ability to interpret these control characters, and act as appropriate if possible. The control codes can be used at any point in the document. -- In conclusion I favour option (b) : emphasis on Unicode. -- Aaron David IRVINE, PhD.![]()
From: "Dan Kegel" <dank@knowledge.adventure.com> Date: Fri, 23 Dec 1994 01:58:04 -0800 (PST) Subject: [www-mling,00138] HTML in Unicode? (fwd) Message-Id: <9412230158.AA03110@knowledge.adventure.com>
Forwarded message: > Newsgroups: comp.software.international,comp.text.sgml > From: David_Goldsmith@taligent.com (David Goldsmith) > Date: Thu, 15 Dec 1994 02:51:13 GMT > > In article <3cm88r$t88@gap.cco.caltech.edu>, dank@alumni.caltech.edu > (Daniel R. Kegel) wrote: > > There is a multilingual WWW mailing list you might be interested in, > > www-mling@square.ntt.jp. Also check out the homepage of its author, > > <a href="http://www.ntt.jp/people/takada/takada.html">takada</a> > > There is also a mailing list for discussing use of Unicode with WWW. Send > subscription requests to www-request@unicode.org. Even if your first reaction is "But Unicode isn't the answer!", no need to worry. I bet the folks who support Unicode for HTML will also be happy to support ISO2022 and existing national character sets (right, guys?). They love the fact that Unicode makes a great intermediate representation for translating between character sets, and it's reasonable to have a place to discuss this & other aspects of using Unicode on the Web. In fact, if you don't watch out, you may find yourself considering using Unicode internally as part of your character set translation process :-) - Dan![]()