From: jjc@jclark.com (James Clark)
Date: Thu, 15 Dec 1994 11:31:29 +0000
Subject: [www-mling,00095] Re:   SJIS & HTML
Message-Id: <9412151131.AA05448@jclark.com>



> Date: Wed, 14 Dec 1994 19:50:48 -0800
> From: Bob Jung <bobj@mcom.com>
> 
> My question was NOT how ***should*** the browsers handle Japanese.
> Today, there exist SJIS html pages on the web (some are generated
> on-the-fly by delegate servers) and I believe most browsers today
> still parse byte-by-byte NOT char-by-char.
> 
> So to rephrase my question:
>         Isn't there a conflict with the HTML metachars and won't the
>         2nd byte of some SJIS characters be incorrectly interpreted
>         as '<' or '>'?  If so, how are these conflicts being handled?
> 
> I've viewed the source of a couple of these SJIS pages and haven't
> seen any special handling.  It looks like straight SJIS with HTML
> tags.  But then probably the SJIS characters in these files did not
> have any conflicts with the 2nd byte.
> 
> >James Clark's new free SGML parser "sp" is a good implementation that
> >uses neutral internal representation. I heard that it scans EUC or
> >SJIS and translate it into Unicode as an internal representation.
> 
> Is "sp" SJIS aware?

Yes.

>  How does it avoid the above problem unless the
> content developer assigns (as Sato-san writes):
>         >the appropriate bit
>         >combinations to both of latin alphabets and Japanese characters to
>         >distinguish them.

The content developer just has to tell the entity manager that the
file uses SJIS.

The idea is that the files in which your document is stored and the
entities that the SGML parser parses may use different encodings.
Entities can only use a encoding that can be described by an SGML
declaration: each character is represented by a single bit
combination, which may be of arbitrary width; this encoding must also
be stateless.  Files, on the other hand, can use completely arbitrary
coding systems: each character may be represented by a variable number
of bytes, in a possibly stateful way.  The entity manager therefore
needs to have the capability to translate from the file coding system
to the entity coding system.  Therefore when you identify an entity to
sp's entity manager, in addition to specifying the file (or, more
generally, the object in which the entity is stored), you can also
specify an encoding translation.

So to handle SJIS, you need to specify a translation that translates
from SJIS.  Currently SP provides only one translator that translates
from SJIS: it translates to the standard fixed-width 16-bit EUC
encoding.  In this encoding every character is represented by a 16-bit
code: the code of characters in the G0 set (usually the Japanese
version of ISO 646) is unchanged; the code of characters in the G1 set
(usually JIS X 0208-1990) is ORed with 0x8080; the code of characters
in the G2 set (usually half-width katakana from JIS X 0201-1986) is
ORed with 0x0080; the code of characters in the G3 set (JIS X
0212-1990) is ORed with 0x8000.  (I'm told this is standard practice
in the Japanese SGML community.)  You could equally well have a
translator that translates to Unicode.

SP also provides a translator that translates from EUC to the same
fixed-width 16-bit encoding.  The important point is that the *parser*
sees exactly the same sequence of codes whether the file uses SJIS or
EUC.  Of course, with this approach, you also completely avoid
problems of false delimiter recognition.

James Clark
jjc@jclark.com


From: Ken Itakura U3/ISE-Japan 8-694-6422 DECpark 4F  15-Dec-1994 1509 <itakura@jrdv04.enet.dec-j.co.jp>
Date: Thu, 15 Dec 94 15:08:59 +0900
Subject: [www-mling,00094] RE:  Netscape & Unicode (was:  Beware of the bureaucrats: the future of Multilingual WWW?)
Message-Id: <9412150608.AA12827@jrdmax.jrd.dec.com>


Hi,

I support Bob and Dan.
We cannot ignore the 'de facto', which is JIS, Shift JIS and EUC in Japan, and 
we cannot ignore Unicode, neither. I think the situation won't change in 
future. Even if the Unicode get popularity, we cannot ignore 'de facto'.
The Unicode might an answer of I18n, and we could choose it for the single 
codeset in the world of WWW. But unfortunately, it's too late, since we have 
many pages written in non Unicode. We have to seek the way to live with both
Unicode and 'de facto' codeset, here again.
I think the answer to this situation is that HTML has the tag to specify the 
codeset, like <charset ="ISO2022-JP">(...</charset>). Unicode can be one of 
the selection of this. Then the creators of the browser can decide the 
priority of its supporting codeset by the market requirement.
The migration path will be...
1. implement for de facto standard codeset.
2. implement for Unicode
3. implement to analyze charset tags.
Even after the charset tags are defined, we need the UI to specify the 
preferred codeset for old documents.

Comment?

Ken Itakura


From: Ken Itakura U3/ISE-Japan 8-694-6422 DECpark 4F  15-Dec-1994 1527 <itakura@jrdv04.enet.dec-j.co.jp>
Date: Thu, 15 Dec 94 15:37:43 +0900
Subject: [www-mling,00093] RE:   SJIS & HTML
Message-Id: <9412150637.AA13644@jrdmax.jrd.dec.com>


>So to rephrase my question:
>        Isn't there a conflict with the HTML metachars and won't the
>        2nd byte of some SJIS characters be incorrectly interpreted
>        as '<' or '>'?  If so, how are these conflicts being handled?

I think SJIS doesn't have the character that conflict with '<' nor '>'.
The second byte of SJIS doesn't use all of ASCII, it just uses some of ASCII.
If you are talking about JIS, JIS has such confliction. But JIS is a kind of 
stateful encoding. You can solve this confliction by handling the state 
correctly. 

Ken


From: CHEN YiLong <cyl@ifcss.org>
Date: Wed, 14 Dec 1994 23:36:10 -0400 (EST)
Subject: [www-mling,00092] cnd.org's solution for HZ (Was: SJIS & HTML)
Message-Id: <Pine.PCW.3.91.941214232837.9399A-100000@PPP-72-6.BU.EDU>

On Thu, 15 Dec 1994 dank@knowledge.adventure.com wrote:
> 
> Um... what would be the proper escaping?  Would you use &lt; in the middle
> of an SJIS character to represent the 2nd byte?  Doesn't sound right.
> - Dan

What you described is the approach that cnd.org takes in its 
Chinese WWW server's html pages, when using HZ code, otherwise the 
Chinese characters won't show up correctly.

Nelson Chin
--
butta1@bu.edu/cyl@ifcss.org


From: dank@knowledge.adventure.com
Date: Wed, 14 Dec 94 20:51 PST
Subject: [www-mling,00091] Re:  Netscape & Unicode (was:  Beware of the bureaucrats: the future of Multilingual WWW?)
Message-Id: <9412142051.AA02732@knowledge.adventure.com>


Hi Bob!

>Somehow Dan got the wrong impression about the direction we at Netscape
>are pursuing.  Supporting Unicode is definitely a direction for our
>product.
>However, today there is not much data on the web in Unicode, so we
>are not planning to ***hastily rush*** any solution to the market.
>On the otherhand, we are being careful that products that we release
>NOW will not prohibit us from supporting Unicode and other future
>multilingual solutions.
>We are very interested in working with the rest of the I18N community
>to come up with the "correct" extensions to the standards.

I can see that your heart is in the right place (and your experience at Apple
can only help; Apple is IMHO one of the best at writing international
software).
But I think I got the right impression about your direction.  It seems to me
that at least partial Unicode compatibility should be in the first release of 
your multilingual browser because
1. it is easy to implement; all you need is a translation table to the local character
   set.  The table will be about 64K in size, and doesn't need to be loaded 
   into memory unless it is used.  I believe Windows 95 will include this table in
   the operating system.
2. The impact of Netscape's browsers is incredible.  If your first browser doesn't
   support Unicode, nobody will bother to put any Unicode data on the Web.

One needn't implement *all*, or even much, of Unicode for it to be useful.  
Starting off by supporting the part that corresponds to the local character 
set seems like a sensible and easy way to start, as it would be a critical first
step towards usability of Unicode on the net.  Who would put Unicode content
up if there were no browsers?  Nobody.  Who would write a commercial browser 
that supported Unicode fully if there were no Unicode content on the net?  Nobody.
The way to break the deadlock is to take a very small incremental step towards
Unicode support in browsers, while us Web content providers take small steps towards
providing Unicode content on the Web.

My two bits, anyway.
- Dan


From: dank@knowledge.adventure.com
Date: Wed, 14 Dec 94 20:49 PST
Subject: [www-mling,00090] Re:   SJIS & HTML
Message-Id: <9412142050.AA02714@knowledge.adventure.com>


Bob asks:
> The delegate server ***could*** be adding the proper escaping of the
> conflicting bytes (as mentioned above) when it is converts the text
> from JIS to SJIS... And maybe noone is hand-typing in SJIS HTML?

Um... what would be the proper escaping?  Would you use &lt; in the middle
of an SJIS character to represent the 2nd byte?  Doesn't sound right.
- Dan


From: "Nelson Chin <butta1@ifcss.org>" <butta1@ifcss.org>
Date: Wed, 14 Dec 1994 22:51:53 -0500 (EST)
Subject: [www-mling,00089] Netscape and Unicode
Message-Id: <Pine.A32.3.91.941214222956.163869A-100000@acs3.bu.edu>


you may have heard about the unicode xterm demo release, which can be 
used as a viewer in X Windows enviroment, it can also support Chinese 
codes, but i'm waiting for the uxterm author to come back from Holiday 
break to add Japanese/Korean national code support, since the revised 
unicode font now has jis/ksc roundtrip capability.

unicode font is temporary at
ftp://crsa.bu.edu/incoming/unihan

unihan portion still have gaps in it, volunteers welcomed to fill them in.
for more info on the font format see,
ftp://cnd.org/pub/software/info/HBF.html

if you're spawning a uxterm session not using the default code (utf-8),

At command line, type:
----------------------
uxterm -code code_index

At UTerm.ad, add the line:
--------------------------
*codeIndex:     code_index

where code_index:
UTF_8   12
UTF_7   13
GB       1
HZ      35
HZX     36
BIG5     2

(info courtesy of Chong Chiah Jen <chiahjen@iss.nus.sg>)

for more info on unicode check out URL:

http://www.stonehand.com/unicode.standard.html


From: bobj@mcom.com (Bob Jung)
Date: Wed, 14 Dec 1994 19:50:48 -0800
Subject: [www-mling,00088] Re:   SJIS & HTML
Message-Id: <199412150350.TAA02082@neon.mcom.com>


My question was NOT how ***should*** the browsers handle Japanese.
Today, there exist SJIS html pages on the web (some are generated
on-the-fly by delegate servers) and I believe most browsers today
still parse byte-by-byte NOT char-by-char.

So to rephrase my question:
        Isn't there a conflict with the HTML metachars and won't the
        2nd byte of some SJIS characters be incorrectly interpreted
        as '<' or '>'?  If so, how are these conflicts being handled?

I've viewed the source of a couple of these SJIS pages and haven't
seen any special handling.  It looks like straight SJIS with HTML
tags.  But then probably the SJIS characters in these files did not
have any conflicts with the 2nd byte.

>James Clark's new free SGML parser "sp" is a good implementation that
>uses neutral internal representation. I heard that it scans EUC or
>SJIS and translate it into Unicode as an internal representation.

Is "sp" SJIS aware?  How does it avoid the above problem unless the
content developer assigns (as Sato-san writes):
        >the appropriate bit
        >combinations to both of latin alphabets and Japanese characters to
        >distinguish them.

The delegate server ***could*** be adding the proper escaping of the
conflicting bytes (as mentioned above) when it is converts the text
from JIS to SJIS... And maybe noone is hand-typing in SJIS HTML?

Can someone please confirm or contradict my understanding?

Thanks,
Bob

>Bob Jung writes:
> > Don't most of the parsers scan byte-by-byte for the meta chars?
> > So, isn't this a problem?
>
>dank@knowledge.adventure.com writes:
> > I expect any good multi-lingual web client will scan char-by-char rather
> > than byte-by-byte.
> > - Dan
>
>Right.
>
>In terms of SGML, a SGML parser includes entity manager as a part of
>it.  Entity Manager have a resposibility on resolving external
>representation. It find a file and scan it, then interpret the
>character in the file(is it a latin alphabet?, is it a Japanese
>character?) and pass the SGML character to the parse module.
>SGML character is a (fixed length) bit combination that represent
>the "character" and it has a unique character number which is
>usually denoted by decimal digits.
>
>If you want to treat Japanese Character in strictly comforming to the
>SGML standard(ISO 8879), you should assign the appropriate bit
>combinations to both of latin alphabets and Japanese characters to
>distinguish them.
>
>Of course, you don't have to provide such functions in practical, but
>you need a proper internal representation or some character
>recognition method.
>
>For example, you can use UJIS(EUC) code as an internal representation
>because there is no mis-interpretation for HTML significant
>characters, and provide code conversion at reading and writing HTML
>file.  Or you can make a character recognition module as context
>sensitive, that mean, the module that recognize the script of its
>character at scaning strings.
>
>James Clark's new free SGML parser "sp" is a good implementation that
>uses neutral internal representation. I heard that it scans EUC or
>SJIS and translate it into Unicode as an internal representation.
>
>You can be reached to this information by accessing
>http://www.jclark.com/
>
>There is also an information on DSSSL and DSSSL Lite.
>
>---------------------------------------------------------------------------
>TSUCHIYA Satoshi     | tsuchiya@sysrap.cs.fujitsu.co.jp | NIFTY-ID:GDC02435

Bob Jung                        +1 415 254-1900 x2788   fax +1 415 254-2601
Netscape Communications Corp.   650 Castro #500         Mtn View, CA 94041


From: bobj@mcom.com (Bob Jung)
Date: Wed, 14 Dec 1994 19:15:28 -0800
Subject: [www-mling,00087] Netscape & Unicode (was:  Beware of the bureaucrats: the future of Multilingual WWW?)
Message-Id: <199412150315.TAA00955@neon.mcom.com>

Hi Dan,

>One other group that is working on what would become a de facto standard
>is Netscape Communications.  They posted here recently, so they are very
>interested in multilingual WWW, but they don't seem to be planning Unicode
>support, which I find troubling.

Somehow Dan got the wrong impression about the direction we at Netscape
are pursuing.  Supporting Unicode is definitely a direction for our
product.

However, today there is not much data on the web in Unicode, so we
are not planning to ***hastily rush*** any solution to the market.
On the otherhand, we are being careful that products that we release
NOW will not prohibit us from supporting Unicode and other future
multilingual solutions.

We are very interested in working with the rest of the I18N community
to come up with the "correct" extensions to the standards.

Regards,
Bob Jung
Software Internationalization Manager
Netscape Communications Corporation

==========================================================================
Date: Thu, 15 Dec 94 10:47:49 +0900
From: dank@knowledge.adventure.com
Date: Wed, 14 Dec 94 18:25 PST
To: www-mling@square.ntt.jp
Subject: [www-mling,00085] Re:  Beware of the bureaucrats: the future of
Multilingual WWW?
Reply-To: www-mling@square.ntt.jp
Ml-Name: www-mling
Sender: takada@square.ntt.jp
Errors-To: www-mling-request@square.ntt.jp
Mail-Count: 00085
X-UIDL: 787456674.000

Jim Fetters writes:
> What will happen to the future of HTML and SGML mark-up
> languages as standards are being revised without support for
> multiple foreign language character sets?  ...
> So, I encourage everyone on this list to become involved in finding
> out the various proposed standards for HTML and SGML for WWW and
> actively lobbying for multilingual standards.

Here, here!  Jim, can you tell us how to subscribe to that mailing list
you described?

One other group that is working on what would become a de facto standard
is Netscape Communications.  They posted here recently, so they are very
interested in multilingual WWW, but they don't seem to be planning Unicode
support,
which I find troubling.

IMHO, any good multilingual WWW standard should include support both for
all popular national character sets/encoding AND Unicode.  Unicode is vital to
serving the need of communities that don't have a big enough hacker community
or market presence to get their own national standards supported.
- Dan

Bob Jung                        +1 415 254-1900 x2788   fax +1 415 254-2601
Netscape Communications Corp.   650 Castro #500         Mtn View, CA 94041


From: saty@skuld.ossi.com (Mr. Tsuchiya)
Date: Wed, 14 Dec 94 18:07:04 PST
Subject: [www-mling,00086]  SJIS & HTML
Message-Id: <9412150207.AA02831@skuld.ossi.com.ossi.com>

Bob Jung writes:
 > Don't most of the parsers scan byte-by-byte for the meta chars?
 > So, isn't this a problem?

dank@knowledge.adventure.com writes:
 > I expect any good multi-lingual web client will scan char-by-char rather
 > than byte-by-byte.
 > - Dan

Right.

In terms of SGML, a SGML parser includes entity manager as a part of
it.  Entity Manager have a resposibility on resolving external
representation. It find a file and scan it, then interpret the
character in the file(is it a latin alphabet?, is it a Japanese
character?) and pass the SGML character to the parse module.
SGML character is a (fixed length) bit combination that represent
the "character" and it has a unique character number which is 
usually denoted by decimal digits.

If you want to treat Japanese Character in strictly comforming to the
SGML standard(ISO 8879), you should assign the appropriate bit
combinations to both of latin alphabets and Japanese characters to
distinguish them.

Of course, you don't have to provide such functions in practical, but
you need a proper internal representation or some character
recognition method.

For example, you can use UJIS(EUC) code as an internal representation
because there is no mis-interpretation for HTML significant
characters, and provide code conversion at reading and writing HTML
file.  Or you can make a character recognition module as context
sensitive, that mean, the module that recognize the script of its
character at scaning strings.

James Clark's new free SGML parser "sp" is a good implementation that
uses neutral internal representation. I heard that it scans EUC or
SJIS and translate it into Unicode as an internal representation.

You can be reached to this information by accessing 
http://www.jclark.com/

There is also an information on DSSSL and DSSSL Lite.

---------------------------------------------------------------------------
TSUCHIYA Satoshi     | tsuchiya@sysrap.cs.fujitsu.co.jp | NIFTY-ID:GDC02435


From: dank@knowledge.adventure.com
Date: Wed, 14 Dec 94 18:25 PST
Subject: [www-mling,00085] Re:  Beware of the bureaucrats: the future of Multilingual WWW?
Message-Id: <9412141825.AA01840@knowledge.adventure.com>

Jim Fetters writes:
> What will happen to the future of HTML and SGML mark-up
> languages as standards are being revised without support for
> multiple foreign language character sets?  ...
> So, I encourage everyone on this list to become involved in finding
> out the various proposed standards for HTML and SGML for WWW and
> actively lobbying for multilingual standards.

Here, here!  Jim, can you tell us how to subscribe to that mailing list
you described?

One other group that is working on what would become a de facto standard
is Netscape Communications.  They posted here recently, so they are very
interested in multilingual WWW, but they don't seem to be planning Unicode support,
which I find troubling.  

IMHO, any good multilingual WWW standard should include support both for
all popular national character sets/encoding AND Unicode.  Unicode is vital to
serving the need of communities that don't have a big enough hacker community
or market presence to get their own national standards supported.
- Dan


From: "Jim A. Fetters" <fetters@enuxsa.eas.asu.edu>
Date: Wed, 14 Dec 1994 07:08:38 -0700 (MST)
Subject: [www-mling,00084] Beware of the bureaucrats: the future of Multilingual WWW?
Message-Id: <Pine.SOL.3.90.941214065243.27851A-100000@enuxsa.eas.asu.edu>



What will happen to the future of HTML and SGML mark-up
languages as standards are being revised without support for
multiple foreign language character sets?

Right now others are deciding the "fate" of the WWW without your
input.  I am on a mailing list for DSSSL-Lite which is a proposed
"future" WWW standard incorporating many SGML features.  While the
specifications are amibitious and may add to the flexibility of
WWW markup languages, it neverless fails short of providing capability
for multi-lingual support.  According to the proposers of DSSSL-Lite
multi-lingual will not be a part of their "global standard."

The so-called "standards" which are proported to be enacted fail
to look at who wants to communicate and what their needs will be.
How can something be a "standard" when it shuts out the entire pacific
rim to multimedia communication. 

How arrogant to think that a few select people will control 
standards for WWW mark-up languages virtually silencing most of the
non-English speaking world.  As WWW promises to bring us closer
to a higher level of global communication, it is we, the subscribers
of this list who should be pushing, or rather demanding support
for multilingual WWW mark-up languages.  

So, I encourage everyone on this list to become involved in finding
out the various proposed standards for HTML and SGML for WWW and
actively lobbying for multilingual standards.




Jim Fetters


From: dank@knowledge.adventure.com
Date: Wed, 14 Dec 94 11:11 PST
Subject: [www-mling,00083] Re:  SJIS & HTML
Message-Id: <9412141112.AA12800@knowledge.adventure.com>


Bob Jung asks:
> Don't most of the parsers scan byte-by-byte for the meta chars?
> So, isn't this a problem?

I expect any good multi-lingual web client will scan char-by-char rather
than byte-by-byte.
- Dan