Closed Bug 13393 Opened 25 years ago Closed 24 years ago

Implement Accept-Charset Header according to HTTP/1.1

Categories

(Core :: Networking: HTTP, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla0.9

People

(Reporter: momoi, Assigned: darin.moz)

References

()

Details

(Keywords: intl)

Attachments

(3 files)

In 5.0, there is currently no Accept-Charset header entry in our HTTP request headers. We should implement as we did in 4.x. Currently DSGW4.x/3.x requires Accept-Charset header from a client. We do need to revise the way this was implemented in 4.x. There we had something like this and it was hard-coded: primary_charset, *, utf-8 and L10n had to localize this value for Win. Mac and Unix simply shipped with Latin 1 values, which was not correct. But given that there was noe easy way to localize the values, this was understandable. Under 5.0, we should do something like the following honoring HTTP/1.1: primary_charset, utf-8, *;q=0.8 The idea is to supply the "primary_charset" based on the user's selection of the default language as described in the 5.0 Intl UI proposal document: http://rocknroll/users/momoi/publish/seamonkey/50intlui.html This way, L10n need not be involved at all in setting this manually. As to the "q" values, we should just pick an arbitrary value (less than 0) for the 3rd arugument, "*".. Our aim should be to give servers choices to pick from Primary_charset or UTF-8, or any other charset if they cannot provide either of the 2 main choices. The value for the 4.x prefs.js line looks like this: user_pref("intl.accept_charsets", "iso-8859-1,utf-8,*;q=0.8");
Correction: "..As to the "q" values, we should just pick an arbitrary value (less than 0).." I meant arbitrary value (less than 1).
Assignee: ftang → warren
Warren, Necko need to implement the back end of this. You just need to pick up the pref value and our group will do (or find someone to do) the pref UI part.
LDAP gateway is depend on this.
Status: NEW → RESOLVED
Closed: 25 years ago
Resolution: --- → DUPLICATE
*** This bug has been marked as a duplicate of 12790 ***
Frank, we need to make sure that our part will be done so that proper values are picked up when #12790 is fixed. Should we open another bug for that?
Warren, Bug 127990 talks of only one of the "accept" headers and doesn't refer to "Accept-charset" header specifically though it is quoted in the data sample from 4.61. Does the fix there apply to all Accept-headers?
QA Contact: teruko → momoi
Status: RESOLVED → REOPENED
Status: REOPENED → RESOLVED
Closed: 25 years ago25 years ago
Status: RESOLVED → REOPENED
** Checked with 9/16/99 Win32 build ** I put in 2 prefs.js lines like this: user_pref("intl.accept_charsets", "shift_jis,utf-8,*;q=0.8"); user_pref("intl.accept_languages", "en"); then accessed: http://kaze:8000/bin/echo.cgi and found that we are still not sending either the Accept-Language or the Accept-Charset header. Someone has to make this work. Frank, is this yours now? or it it still warren's?
Until what needs to be done to get the right results, I'm re-opening this bug.
Resolution: DUPLICATE → ---
Assignee: warren → gagan
Status: REOPENED → NEW
Back to Gagan...
Status: NEW → ASSIGNED
Target Milestone: M12
Moving Assignee from gagan to warren since he is away.
Moving what's not done for M12 to M13.
Assignee: warren → gagan
Back to Gagan for M13.
Status: NEW → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → WORKSFORME
From my discussions with Erik, this is more debatable and hence I am closing this for now. Apparently IE doesn't send a charset either and works just fine with directory server. If you feel that this should still be sent then lets discuss this on the newsgroup before opening this bug here again.
Hi, I filed this bug for the convenience of our own DSGW which check for accept-charset to see if it can send UTF-8. When it sees the header we came up with, it then sends UTF-8. Here's a comment on this issue from a DS developer, noriko@netscape.com. > >> Thanks for the explanation. We understand UTF-8 is now more > >> common. We can change DSGW in the next version (5.0) not to check > >> the Accept-Charset. But the DSGW already in the market is > >> expecting the variable... So, if Communicator 5.0 stops sending > >> it, the 4.X/3.X DSGW would get screwed up. I'd like to avoid the > >> risk. > >>
Actually, the word 'convenience' is wrong. It is so that we 'avoid' srewing up our own DS Gateway which is used in web-based access to DS data. I agree that from an Internet protocl level discussion, this feature is debatable, but there is also a practical issue.
I'll send you guys one of the msgs I exchanged with DS people.
MSIE does not emit Accept-Charset. How does DSGW handle this situation?
erik, I looked at the charset handling code noriko sent me on DSGW3.x/4.x. It makes special allowance for MS IE4. It doesn't look like it does so for IE 5, however. DSGW seems to decide on charset to use based on Accept-Language and Accept-charset. If there is no Accpet-charset info, it will default to a charset appropriate for the Accept-Language. Take my own serever, polyglot (DSGW 3.x). It can server both Japanese and English interface pages based on Accept-Language. The data contained there. however, has both Japanese and Latin 1 accented characters. Als the search root, o="Netscape" part is in Japanese. I tried the following with the current Mozilla and IE5 with accept-lang set to ja or en. Mozilla w/ ja: 1. Can display Japanese names but not Latin 1 accents (because DSGW does not use UTF-8 but Shift_JIS charset.) Mozilla w/ en: 2. Cannot find a single entry because the search root o="Netscape" is in Japanese but charset used is ISO-8859-1 in this case, and thus ldap url simply fails to match. MS IE 5 w/ja: 3. 1. Can display Japanese names but not Latin 1 accents (because DSGW does not use UTF-8 but Shift_JIS charset.) MS IE 5 w/en: 4. It even refuses to display the first page in the gateway because it contains data from "o="Netscape"" in Japanese but the charset sent in is ISO-8859-1. In summary, not sending accept-charset and thus enabling DSGW to send data in UTF-8 spells disaster for these DSGW 3.x/4.x users whos may have 1) multilingual data, and/or 2) ldap attribute names in in non-ASCII. I am very much inclined to re-open this bug for the above reasons. If you don't want me to, please privide arguments before too long.
Needless to say, 4.72 I'm using now had none of the problems mentioned above.
The sniffer script DSGW 3.x/4.x uses has a special allowance for IE4 and so, though I haven't tried it, IE4 probably gets UTF-8 data from DSGW and thus avoids these problems.
My suggestion is to update the sniffer script for DSGW's next version. If the sniffer script is able to deal with MSIE4, then it should be able to deal with Mozilla 5. Also, current DSGW customers can be asked to update their script, which hopefully is a text file. MSIE5 does not emit Accept-Charset, and MSIE5 has a large market share. If DSGW is interested in supporting a large fraction of Internet users, DSGW will have to make changes to their own releases and to their customers' installations. Mozilla is trying to reduce the amount of stuff it sends out with EVERY HTTP request. Accept-Charset has limited value. Mozilla needs to weigh all of these factors and make a decision. It's not my decision to make, but my opinion is that Mozilla 5 should refrain from emitting Accept-Charset for the above reasons.
I'm reasonably sure that what you suggest are all doable. I have no idea, however, how practial that is in this situation or how much extra work that would entail. I hear occasionally from Russian users that their sites use accept-charset sniffer. I guess in languages where multiple charsets are competing, accept-charset would be nice but again I don't know how sorely this is needed for such a case. I think I've stated the reasons for re-opening the bug. Other opinions are welcome.
I've talked to noriko further about this and it looks like the script is part of C code and cannot be changed without patching the source itself. This will fall into the sustaining engineering's area. There is apparently less than perfect but nonetheless a way to turn off accept-charset sniffing and send UTF-8 data, however. This will be a tech support issue. I don't necessarily buy an argument that we are sending too many HTTP headers -- I compared IE5 and Comm4.72 and the difference is only 1. IE5 does not send out accept-charset. But I can buy an argument that we should not send out what is not an important or sorely needed HTTP header. This might at this point in time fall into that category. The only other point I would like to pursue is that others in the net community agree with this assessment. It won't hurt to ask before verifying the resolution. And that is what I will do now.
I've publicly asked net people about this feature and no one expressed concern about this feature not in Mozilla. The question was asked some time ago and I now feel that we have waited long enough for reaction. I think the resolution should be wontfix rather than worksforme, however.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Changing theresolution to WontFix.
Status: REOPENED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → WONTFIX
Verified as Wontfix.
Status: RESOLVED → VERIFIED
I have read through all the arguments in bug 13393 and would like to weigh in with a few: >>>>> From Erik van der Poel 2000-01-22 13:56 ------- ep> MSIE5 does not emit Accept-Charset, and MSIE5 has a large market ep> share. If DSGW is interested in supporting a large fraction of ep> Internet users, DSGW will have to make changes to their own ep> releases and to their customers' installations. As far as I remember, MSIE5 sends HTTP/1.1 and thus is required to understand UTF-8. (cf. chapter 14.2. HTTP/1.1). If a browser understands UTF-8 and everybody knows this because it is HTTP/1.1, it can refrain from sending this header, it would be redundant. But even in this case it would still be polite to send Accept-Charset because a HTTP/1.0 proxy will be required to downgrade a request to HTTP/1.0 and thus the server can't find out that the browser behind the proxy is HTTP/1.1. ep> Mozilla is trying to reduce the amount of stuff it sends out with ep> EVERY HTTP request. Accept-Charset has limited value. Mozilla needs ep> to weigh all of these factors and make a decision. It's not my ep> decision to make, but my opinion is that Mozilla 5 should refrain ep> from emitting Accept-Charset for the above reasons. Erik doesn't say why. I honour the decision to send terse headers, but it is a wrong decision to say, let's just follow IE5. As long as we do not have the arguments on the table why they decided their way, we must find them out ourselves. >>>>> ------ Additional Comments From Katsuhiko Momoi 2000-01-22 14:22 ------- km> I'm reasonably sure that what you suggest are all doable. I have km> no idea, however, how practial that is in this situation or how km> much extra work that would entail. I hear occasionally from km> Russian users that their sites use accept-charset sniffer. I guess km> in languages where multiple charsets are competing, accept-charset km> would be nice but again I don't know how sorely this is needed for km> such a case. I'm not speaking for languages where multiple charsets are competing, I'm speaking from the perspective of an i18n'd server, of which I have implemented a few. An i18n'd server typically works with Unicode internally and converts on request. The server can be implemented in a language-ignorant way, it sends many languages. Talking about language here somehow muddies the waters. If Mozilla doesn't send Accept-Charset, the server side must convert to iso-8859-1 because this was the standard charset in HTTP/1.0. Period. So my revised suggestion of how to form this header would be: Accept-Charset: utf-8,*;q=0.8 and leave the primary charset out of the equation. I see no reason why the primary charset should be announced to servers at all. Mozilla can convert to it anyway. And if the conversion would be lossy, it would be wise not to convert to it. But that's beyond the scope of this bugid. -- andreas
Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
Andreas, the LDAP server case I was referring to above is one example of your i18n'ed server. It stores all the data in UTF-8. It then sends that data to a client in UTF-8 or in an ecoding appropriate for the language of the client in case the client does not say explicitly say what charset it can accept. (The question of language does come into play for certain types of data.)
*** Bug 48361 has been marked as a duplicate of this bug. ***
"in an encoding appropriate for the language of the client" is a very vague concept. What if the document is not in the language of the client and is not displayable in the encoding appropriate for the language of the client. Note that the language of the client can be a set of languages too.
Andreas, there are many different ways to make use of accept-charset. If you have a directory server which is deployed in an environment predominantly Japanese. LDAP protocol default charset is UTF-8. Thus all the data would be in UTF-8. Now if a Japanese client accesses it and says that the primary charset is Shift_JIS but UTF-8 is OK. Then the server simply sends UTF-8. If not, it sends Shift_JIS encoded Japanese data. This kind of use is what we have in the case described above. Then there are the kind of cases you describe above. You may have many language data on a single page which can be encoded in ISO-8859-1 or UTF-8. The notion of primary charset is quite useful in some of these cases. Note also that ISO-8859-1 is always assumed even if it is not explicitly listed.
Katsuhiko, I'd like to structure the things to discuss, not all of them need to be addressed or resolved now. 1. Should Mozilla send an Accept-Charset header that contains at least utf-8 and "*"? I believe my arguments above proof this is necessary, and Mozilla should have it, at least for the next few years during which the rest of the world is not utf-8 safe. 2. Should Mozilla have the notion of a primary charset? I did not question this and I still believe it is useful for Mozilla. I see the main usefulness when it comes to storing content on disk, but also when it comes to browsing sites that do not declare their charset and heuristics are needed to determine it. But this is an entirely different problem domain, so let's not get carried away with these problems. 3. Should Mozilla include the primary charset in the Accept-Charset header? I see no need to. Mozilla can most probably read any charset and this is expressed with the star. If Mozilla has no bugs in the conversion engine, it makes no difference for the user if he gets a LATIN SMALL LETTER C WITH CEDILLA as u+00E7 in utf-8 or as 0xE7 in iso-8859-1. Or to try an equivalent, 0xC4 0xFF84 in Shift-Jis is a HALFWIDTH KATAKANA LETTER TO, but u+FF84 is the same thing. No need to express a preference of one over the other. 4. Does the user need to be able to configure the Accept-Charset header? I see no reason to. Same argument as in (3) above. 5. Does Mozilla need to consider the set of languages the user has chosen in the language preferences when sending the Accept-Charset header? I'd say, definitely not. Among the 5 topics, only #1 needs to be adressed.
My response to issues raised by Andreas: #1: Agreed. #2: We already have that expressed in Navigator Default charset in the Preferences. (This is the client side preference setting and has no interactive aspect with servers.) #3: In an ideal world, this would be true. But just like your argument in #1, i.e. the world is not UTF-8 safe yet, not every one would tag their Unicode documents with a lang tag indicating what language that is. And Mozilla has dependency on language for which font glyphs to use. For example, Unicode CJK ideographs are not necessarily rendered the same from language to language. The same code point may lead to different font glyphs dependent on what language it is. Unless every one uses a lang tag, I may end up seeing a Japanese document with some Chinese glyphs. And I definitely don't want that! (See how fonts are set in the preference dialog -- according to language. But if language info is not available in the docs, we do our best by looking at the charset info -- a charset is a good secondary determining factor for some language, e.g. Chinese, Japanese, Korean, etc.. Thus, the notion of primary charset is still useful in this situation. ) #4: The user does not have to as long as the localization process can take care of it. #5: Agreed. But we may use the Navigator default charset for this.
Thank you for the background info for #3--very interesting, I see more light now and agree with you.
Target Milestone: M13 → Future
*** Bug 60496 has been marked as a duplicate of this bug. ***
There is a patch attached to bug 60496, by the way.
Added "patch" keyword.
Keywords: patch
Thanks a lot for the patch! There's some purely cosmetic thing left. When the default character set chosen via Preferences/Languages is "Unicode (UTF-8)", then the resulting Accept-Charset header becomes: Accept-Charset: UTF-8, utf-8; q=0.667, *; q=0.667 which seemingly is legal but redundant.
Koenig: whoops, you're right... the patch is designed to avoid the duplicate "utf-8", but it doesn't check for case. Change line 116 of the patch from: + if (PL_strstr(acceptable, "utf-8") == NULL) { to + if (PL_strcasestr(acceptable, "utf-8") == NULL) { and that should do the trick.
Also, while the Language Preference screen won't let you do it, the above patch will allow a comma separated list of character set/encodings in the intl.charset.default, which you can set by manually editing your prefs.js. Nothing else seems to use intl.charset.default (true?), but if something else isn't expecting a comma delimited tokens in that preference, this could get you into trouble.
intl.charset.default must be a single item entry. (No comma delimited list should be in it -- it defeats the prupose of this pref!) It is your default fallback encoding for browsing in case HTTP, HTTP Meta-Equiv, or Auto-detection cannot give you a document charset. For Composer, it is used as the default encoding for a new document. This value should be set by a localizer to be suitable for each locale. It has a UI also: Edit | Prefs | Navigator | Languages | Character Coding.
Understood. The above patch is still o.k., because while _it_ can handle a comma delimited list, it doesn't add a comma list to the pref itself-- just a little bit of (unneeded for now, until the patch is changed to use a preference other then intl.charset.default) robustness-- access to intl.charset.default is read-only.
http bugs to "Networking::HTTP"
Assignee: gagan → darin
Status: REOPENED → NEW
Component: Internationalization → Networking: HTTP
QA Contact: momoi → tever
Target Milestone: Future → M19
Keywords: intl
Depends on: 65092
No longer depends on: 65092
Blocks: 65092
nominating for moz 0.9
Target Milestone: --- → mozilla0.9
Looks good. r=darin
adding keyword nsbeta1
Keywords: nsbeta1
Fix checked in.
Status: NEW → RESOLVED
Closed: 25 years ago24 years ago
Resolution: --- → FIXED
You can check what mozilla sends at: http://gemal.dk/browserspy/accept.cgi
Henrik Gemal wrote: > You can check what mozilla sends at: > http://gemal.dk/browserspy/accept.cgi or you can use http://www.mozilla.gr.jp:4321/ which is step B20 of the smoketests at http://www.mozilla.org/quality/smoketests/
verified
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: