From: "Dan Kegel" <dank@knowledge.adventure.com> Date: Sun, 25 Dec 1994 02:06:46 -0800 (PST) Subject: [www-mling,00141] re: Ken's questions Message-Id: <9412250206.AA14150@knowledge.adventure.com>
Ken asked some very good questions about two weeks ago, which I only today have had a chance to read. He writes: > So I'd like to focus the argument here to 'code set', 'Unicode' and > 'bi-directional'. (Can 'bi-directional' be also separated?) I agree, and I do think that bi-directional can be separated. Bidirectionality- as far as I know- only involves the display of text, not its representation in memory on on the network. So I'd like to focuts on just 'code set' and 'Unicode'. > The achievement by now is only the description of 'character set' in the > specification of html+. The description is not clear for me, but I suppose it > says that the default code set can be specified by the MIME header. It's the > first step but it's not enough for multi-lingual document. > One problem is that the specification of SGML (may be) not allow the multiple > code set in a document. (right ?) If so, we have to propose to change the > spec of SGML, too. No, you wouldn't need to extend SGML. Switching character sets can be considered a low-level, sub-character feature, beneath SGML or HTML. > Frankly speaking, I don't know how we should start talking about this. But I > think it's an idea that putting my questions here as the starting point. > > 1. Can Unicode be the single code set for the WWW document? We should plan to support WWW documents which are all Unicode, all ISO2022, all shift-JIS, etc. No need to mandate Unicode. > 2. Is UTF better than canonical Unicode for WWW document? For western languages. Eastern languages would prefer naked Unicode. I think we should support both. > 3. Is ISO2022 suitable/enough for multi-lingual WWW document? It does a lot. Let's support it as one encoding. > 4. Is 'bi-directional' independent from the code set? Yes. > 5. Is it acceptable that converting existing (many) pages > to conform new standard? Yes, but let existing pages remain as they are. We should encourage servers to add a header describing the codeset used, but the only penalty for not adding the header should be that the user will have to select a codeset manually. > 6. Should every browser support every code set in standard? Let's restate that: every browser should support every code set in the standard, but need only display characters that are in the default system font. That is, browsers should not be required to use special Unicode or JIS fonts in order to conform. Lynx in particular would have a hard time with that. Given that there will probably be ten codesets to support, how can we expect browsers to handle them all? Simple: we specify that, if the native system can't handle the codeset in question, the data be converted to the native codeset. If no direct conversion is available, the data should first be converted to Unicode, then to the native codeset. It should be permissible to always go thru Unicode. We can refer implementors directly to the proper tables at unicode.org. However, Asian users may prefer that we treat the codeset of origin as a language hint, and display Japanese text with different fonts than Chinese text. This could be done internally to the browser. This should not be mandated, as the majority of users may be satisfied without it, and those that need it can get browsers that do it. It is sufficient that we avoid mandating converting all characters into Unicode; that would be an insult to some people. > 7. What should the browser do if it cannot display the character > in the document correctly? Replace it with a special symbol. If large areas are not displayable, they should be replaced with a summary (i.e. "10 lines could not be displayed.") > 8. Should the code conversion is done by the server? Hmm, can we think of a situation this would be better than client-side conversion? I don't know yet. > 9. The format of the code set tag. The big question! For backward compatibility, all transactions would be Latin-1 by default (or whatever codeset the user selects as default). I'd like to propose the following ad hoc scheme for detecting the codeset of a stream of bytes: 1. If a naked Unicode byte order mark is present as the first two bytes of the stream, the entire stream is taken to be naked (16 bits per char) Unicode. 2. If a UTF-encoded Unicode byte order mark is present as the first three bytes of the stream, the entire stream is taken to be UTF-encoded Unicode. 3. If an ISO2022-style shift sequence is present as the first two (three?) bytes of the stream, the entire stream is taken to be ISO2022 encoded. 4. Otherwise, if the default code set is ISO2022-compatible, and an ISO2022-style shift sequence is present anywhere in the data stream, the entire stream is taken to be ISO2022 encoded. That ought to handle all the data that's out there now. Can anyone see data on their server that would cause a problem with this scheme? Conforming servers would announce their codeset according to these rules; the documents stored internally on the server needn't conform to them. Whew. Enough for tonight. I look forward to your comments. Dan Kegel![]()