Closed
Bug 13393
Opened 25 years ago
Closed 24 years ago
Implement Accept-Charset Header according to HTTP/1.1
Categories
(Core :: Networking: HTTP, defect, P3)
Core
Networking: HTTP
Tracking
()
VERIFIED
FIXED
mozilla0.9
People
(Reporter: momoi, Assigned: darin.moz)
References
()
Details
(Keywords: intl)
Attachments
(3 files)
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review |
In 5.0, there is currently no Accept-Charset header entry in our HTTP
request headers. We should implement as we did in 4.x.
Currently DSGW4.x/3.x requires Accept-Charset header from a client.
We do need to revise the way this was implemented in 4.x.
There we had something like this and it was hard-coded:
primary_charset, *, utf-8
and L10n had to localize this value for Win. Mac and Unix simply shipped
with Latin 1 values, which was not correct. But given that there was
noe easy way to localize the values, this was understandable.
Under 5.0, we should do something like the following honoring HTTP/1.1:
primary_charset, utf-8, *;q=0.8
The idea is to supply the "primary_charset" based on the
user's selection of the default language as described in the
5.0 Intl UI proposal document:
http://rocknroll/users/momoi/publish/seamonkey/50intlui.html
This way, L10n need not be involved at all in setting this
manually.
As to the "q" values, we should just pick an arbitrary value (less
than 0) for the 3rd arugument, "*".. Our aim should be to give
servers choices to pick from Primary_charset or UTF-8, or any
other charset if they cannot provide either of the 2 main
choices.
The value for the 4.x prefs.js line looks like this:
user_pref("intl.accept_charsets", "iso-8859-1,utf-8,*;q=0.8");
Reporter | ||
Comment 1•25 years ago
|
||
Correction:
"..As to the "q" values, we should just pick an arbitrary value (less
than 0).."
I meant arbitrary value (less than 1).
Updated•25 years ago
|
Assignee: ftang → warren
Comment 2•25 years ago
|
||
Warren, Necko need to implement the back end of this. You just need to pick up
the pref value and our group will do (or find someone to do) the pref UI part.
Comment 3•25 years ago
|
||
LDAP gateway is depend on this.
Updated•25 years ago
|
Status: NEW → RESOLVED
Closed: 25 years ago
Resolution: --- → DUPLICATE
Reporter | ||
Comment 5•25 years ago
|
||
Frank, we need to make sure that our part will be done
so that proper values are picked up when #12790
is fixed. Should we open another bug for that?
Reporter | ||
Comment 6•25 years ago
|
||
Warren, Bug 127990 talks of only one of the "accept" headers and
doesn't refer to "Accept-charset" header specifically though it is
quoted in the data sample from 4.61. Does the fix there apply to all
Accept-headers?
Reporter | ||
Updated•25 years ago
|
Status: RESOLVED → REOPENED
Reporter | ||
Updated•25 years ago
|
Status: REOPENED → RESOLVED
Closed: 25 years ago → 25 years ago
Reporter | ||
Updated•25 years ago
|
Status: RESOLVED → REOPENED
Reporter | ||
Comment 7•25 years ago
|
||
** Checked with 9/16/99 Win32 build **
I put in 2 prefs.js lines like this:
user_pref("intl.accept_charsets", "shift_jis,utf-8,*;q=0.8");
user_pref("intl.accept_languages", "en");
then accessed:
http://kaze:8000/bin/echo.cgi
and found that we are still not sending either the Accept-Language
or the Accept-Charset header.
Someone has to make this work. Frank, is this yours now? or it it still
warren's?
Reporter | ||
Comment 8•25 years ago
|
||
Until what needs to be done to get the right results,
I'm re-opening this bug.
Reporter | ||
Updated•25 years ago
|
Resolution: DUPLICATE → ---
Updated•25 years ago
|
Assignee: warren → gagan
Status: REOPENED → NEW
Comment 9•25 years ago
|
||
Back to Gagan...
Comment 10•25 years ago
|
||
Moving Assignee from gagan to warren since he is away.
Comment 11•25 years ago
|
||
Moving what's not done for M12 to M13.
Updated•25 years ago
|
Assignee: warren → gagan
Comment 12•25 years ago
|
||
Back to Gagan for M13.
Status: NEW → RESOLVED
Closed: 25 years ago → 25 years ago
Resolution: --- → WORKSFORME
Comment 13•25 years ago
|
||
From my discussions with Erik, this is more debatable and hence I am closing
this for now. Apparently IE doesn't send a charset either and works just fine
with directory server. If you feel that this should still be sent then lets
discuss this on the newsgroup before opening this bug here again.
Reporter | ||
Comment 14•25 years ago
|
||
Hi, I filed this bug for the convenience of our own DSGW
which check for accept-charset to see if it can send UTF-8. When
it sees the header we came up with, it then sends UTF-8.
Here's a comment on this issue from a DS developer,
noriko@netscape.com.
> >> Thanks for the explanation. We understand UTF-8 is now more
> >> common. We can change DSGW in the next version (5.0) not to check
> >> the Accept-Charset. But the DSGW already in the market is
> >> expecting the variable... So, if Communicator 5.0 stops sending
> >> it, the 4.X/3.X DSGW would get screwed up. I'd like to avoid the
> >> risk.
> >>
Reporter | ||
Comment 15•25 years ago
|
||
Actually, the word 'convenience' is wrong. It is so that we
'avoid' srewing up our own DS Gateway which is used in
web-based access to DS data.
I agree that from an Internet protocl level discussion, this feature
is debatable, but there is also a practical issue.
Reporter | ||
Comment 16•25 years ago
|
||
I'll send you guys one of the msgs I exchanged with DS people.
Comment 17•25 years ago
|
||
MSIE does not emit Accept-Charset. How does DSGW handle this situation?
Reporter | ||
Comment 18•25 years ago
|
||
erik, I looked at the charset handling code noriko sent me on
DSGW3.x/4.x. It makes special allowance for MS IE4. It doesn't
look like it does so for IE 5, however.
DSGW seems to decide on charset to use based on Accept-Language
and Accept-charset. If there is no Accpet-charset info, it will
default to a charset appropriate for the Accept-Language.
Take my own serever, polyglot (DSGW 3.x). It can server both
Japanese and English interface pages based on Accept-Language.
The data contained there. however, has both Japanese and
Latin 1 accented characters. Als the search root, o="Netscape"
part is in Japanese.
I tried the following with the current Mozilla and IE5 with
accept-lang set to ja or en.
Mozilla w/ ja:
1. Can display Japanese names but not Latin 1 accents (because DSGW
does not use UTF-8 but Shift_JIS charset.)
Mozilla w/ en:
2. Cannot find a single entry because the search root o="Netscape"
is in Japanese but charset used is ISO-8859-1 in this case,
and thus ldap url simply fails to match.
MS IE 5 w/ja:
3. 1. Can display Japanese names but not Latin 1 accents (because DSGW
does not use UTF-8 but Shift_JIS charset.)
MS IE 5 w/en:
4. It even refuses to display the first page in the gateway because
it contains data from "o="Netscape"" in Japanese but the charset
sent in is ISO-8859-1.
In summary, not sending accept-charset and thus enabling DSGW to
send data in UTF-8 spells disaster for these DSGW 3.x/4.x users
whos may have 1) multilingual data, and/or 2) ldap attribute names in
in non-ASCII.
I am very much inclined to re-open this bug for the above reasons.
If you don't want me to, please privide arguments before too
long.
Reporter | ||
Comment 19•25 years ago
|
||
Needless to say, 4.72 I'm using now had none of the problems
mentioned above.
Reporter | ||
Comment 20•25 years ago
|
||
The sniffer script DSGW 3.x/4.x uses has a special allowance
for IE4 and so, though I haven't tried it, IE4 probably gets
UTF-8 data from DSGW and thus avoids these problems.
Comment 21•25 years ago
|
||
My suggestion is to update the sniffer script for DSGW's next version. If the
sniffer script is able to deal with MSIE4, then it should be able to deal with
Mozilla 5. Also, current DSGW customers can be asked to update their script,
which hopefully is a text file.
MSIE5 does not emit Accept-Charset, and MSIE5 has a large market share. If DSGW
is interested in supporting a large fraction of Internet users, DSGW will have
to make changes to their own releases and to their customers' installations.
Mozilla is trying to reduce the amount of stuff it sends out with EVERY HTTP
request. Accept-Charset has limited value. Mozilla needs to weigh all of these
factors and make a decision. It's not my decision to make, but my opinion is
that Mozilla 5 should refrain from emitting Accept-Charset for the above
reasons.
Reporter | ||
Comment 22•25 years ago
|
||
I'm reasonably sure that what you suggest are all doable.
I have no idea, however, how practial that is in this
situation or how much extra work that would entail.
I hear occasionally from Russian users that their sites
use accept-charset sniffer. I guess in languages where multiple
charsets are competing, accept-charset would be nice but again
I don't know how sorely this is needed for such a case.
I think I've stated the reasons for re-opening the bug. Other
opinions are welcome.
Reporter | ||
Comment 23•25 years ago
|
||
I've talked to noriko further about this and it looks like
the script is part of C code and cannot be changed without
patching the source itself. This will fall into the sustaining
engineering's area. There is apparently less than perfect but
nonetheless a way to turn off accept-charset sniffing and
send UTF-8 data, however. This will be a tech support issue.
I don't necessarily buy an argument that we are sending too many
HTTP headers -- I compared IE5 and Comm4.72 and the difference is only
1. IE5 does not send out accept-charset.
But I can buy an argument that we should not send out what is not
an important or sorely needed HTTP header. This might at this
point in time fall into that category.
The only other point I would like to pursue is that others in the
net community agree with this assessment. It won't hurt to ask
before verifying the resolution. And that is what I will do now.
Reporter | ||
Comment 24•25 years ago
|
||
I've publicly asked net people about this feature
and no one expressed concern about this feature not
in Mozilla. The question was asked some time ago
and I now feel that we have waited long enough for
reaction.
I think the resolution should be wontfix rather than
worksforme, however.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Reporter | ||
Comment 25•25 years ago
|
||
Changing theresolution to WontFix.
Status: REOPENED → RESOLVED
Closed: 25 years ago → 25 years ago
Resolution: --- → WONTFIX
Comment 27•24 years ago
|
||
I have read through all the arguments in bug 13393 and would like to
weigh in with a few:
>>>>> From Erik van der Poel 2000-01-22 13:56 -------
ep> MSIE5 does not emit Accept-Charset, and MSIE5 has a large market
ep> share. If DSGW is interested in supporting a large fraction of
ep> Internet users, DSGW will have to make changes to their own
ep> releases and to their customers' installations.
As far as I remember, MSIE5 sends HTTP/1.1 and thus is required to
understand UTF-8. (cf. chapter 14.2. HTTP/1.1). If a browser
understands UTF-8 and everybody knows this because it is HTTP/1.1, it
can refrain from sending this header, it would be redundant. But even
in this case it would still be polite to send Accept-Charset because a
HTTP/1.0 proxy will be required to downgrade a request to HTTP/1.0 and
thus the server can't find out that the browser behind the proxy is
HTTP/1.1.
ep> Mozilla is trying to reduce the amount of stuff it sends out with
ep> EVERY HTTP request. Accept-Charset has limited value. Mozilla needs
ep> to weigh all of these factors and make a decision. It's not my
ep> decision to make, but my opinion is that Mozilla 5 should refrain
ep> from emitting Accept-Charset for the above reasons.
Erik doesn't say why. I honour the decision to send terse headers, but
it is a wrong decision to say, let's just follow IE5. As long as we do
not have the arguments on the table why they decided their way, we
must find them out ourselves.
>>>>> ------ Additional Comments From Katsuhiko Momoi 2000-01-22 14:22 -------
km> I'm reasonably sure that what you suggest are all doable. I have
km> no idea, however, how practial that is in this situation or how
km> much extra work that would entail. I hear occasionally from
km> Russian users that their sites use accept-charset sniffer. I guess
km> in languages where multiple charsets are competing, accept-charset
km> would be nice but again I don't know how sorely this is needed for
km> such a case.
I'm not speaking for languages where multiple charsets are competing,
I'm speaking from the perspective of an i18n'd server, of which I have
implemented a few. An i18n'd server typically works with Unicode
internally and converts on request. The server can be implemented in a
language-ignorant way, it sends many languages. Talking about language
here somehow muddies the waters. If Mozilla doesn't send
Accept-Charset, the server side must convert to iso-8859-1 because
this was the standard charset in HTTP/1.0. Period.
So my revised suggestion of how to form this header would be:
Accept-Charset: utf-8,*;q=0.8
and leave the primary charset out of the equation. I see no reason why
the primary charset should be announced to servers at all. Mozilla can
convert to it anyway. And if the conversion would be lossy, it would be
wise not to convert to it. But that's beyond the scope of this bugid.
--
andreas
Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
Reporter | ||
Comment 28•24 years ago
|
||
Andreas, the LDAP server case I was referring to above is one example of your
i18n'ed server. It stores all the data in UTF-8. It then sends that data to a client
in UTF-8 or in an ecoding appropriate for the language of the client in case
the client does not say explicitly say what charset it can accept.
(The question of language does come into play for certain types of data.)
Comment 29•24 years ago
|
||
*** Bug 48361 has been marked as a duplicate of this bug. ***
Comment 30•24 years ago
|
||
"in an encoding appropriate for the language of the client" is a very vague
concept. What if the document is not in the language of the client and is not
displayable in the encoding appropriate for the language of the client. Note
that the language of the client can be a set of languages too.
Reporter | ||
Comment 31•24 years ago
|
||
Andreas, there are many different ways to make use of accept-charset.
If you have a directory server which is deployed in an environment predominantly
Japanese. LDAP protocol default charset is UTF-8. Thus all the data would be
in UTF-8. Now if a Japanese client accesses it and says that the primary charset is
Shift_JIS but UTF-8 is OK. Then the server simply sends UTF-8. If not,
it sends Shift_JIS encoded Japanese data. This kind of use is what we have in
the case described above.
Then there are the kind of cases you describe above. You may have many language
data on a single page which can be encoded in ISO-8859-1 or UTF-8.
The notion of primary charset is quite useful in some of these cases.
Note also that ISO-8859-1 is always assumed even if it is not explicitly listed.
Comment 32•24 years ago
|
||
Katsuhiko, I'd like to structure the things to discuss, not all of
them need to be addressed or resolved now.
1. Should Mozilla send an Accept-Charset header that contains at least
utf-8 and "*"? I believe my arguments above proof this is necessary,
and Mozilla should have it, at least for the next few years during
which the rest of the world is not utf-8 safe.
2. Should Mozilla have the notion of a primary charset? I did not
question this and I still believe it is useful for Mozilla. I see the
main usefulness when it comes to storing content on disk, but also
when it comes to browsing sites that do not declare their charset and
heuristics are needed to determine it. But this is an entirely
different problem domain, so let's not get carried away with these
problems.
3. Should Mozilla include the primary charset in the Accept-Charset
header? I see no need to. Mozilla can most probably read any charset
and this is expressed with the star. If Mozilla has no bugs in the
conversion engine, it makes no difference for the user if he gets a
LATIN SMALL LETTER C WITH CEDILLA as u+00E7 in utf-8 or as 0xE7 in
iso-8859-1. Or to try an equivalent, 0xC4 0xFF84 in Shift-Jis is a
HALFWIDTH KATAKANA LETTER TO, but u+FF84 is the same thing. No need to
express a preference of one over the other.
4. Does the user need to be able to configure the Accept-Charset
header? I see no reason to. Same argument as in (3) above.
5. Does Mozilla need to consider the set of languages the user has
chosen in the language preferences when sending the Accept-Charset
header? I'd say, definitely not.
Among the 5 topics, only #1 needs to be adressed.
Reporter | ||
Comment 33•24 years ago
|
||
My response to issues raised by Andreas:
#1: Agreed.
#2: We already have that expressed in Navigator Default charset in the Preferences. (This is the client
side preference setting and has no interactive aspect with servers.)
#3: In an ideal world, this would be true. But just like your argument in #1, i.e. the world is not UTF-8 safe
yet, not every one would tag their Unicode documents with a lang tag indicating what language
that is. And Mozilla has dependency on language for which font glyphs to use. For example, Unicode
CJK ideographs are not necessarily rendered the same from language to language. The same
code point may lead to different font glyphs dependent on what language it is. Unless every one
uses a lang tag, I may end up seeing a Japanese document with some Chinese glyphs. And
I definitely don't want that!
(See how fonts are set in the preference dialog -- according to language. But if language info is not
available in the docs, we do our best by looking at the charset info -- a charset is a good
secondary determining factor for some language, e.g. Chinese, Japanese, Korean, etc.. Thus, the notion
of primary charset is still useful in this situation. )
#4: The user does not have to as long as the localization process can take care of it.
#5: Agreed. But we may use the Navigator default charset for this.
Comment 34•24 years ago
|
||
Thank you for the background info for #3--very interesting, I see more light now
and agree with you.
Comment 35•24 years ago
|
||
*** Bug 60496 has been marked as a duplicate of this bug. ***
Comment 36•24 years ago
|
||
There is a patch attached to bug 60496, by the way.
Comment 38•24 years ago
|
||
Comment 39•24 years ago
|
||
Thanks a lot for the patch!
There's some purely cosmetic thing left. When the default character set chosen
via Preferences/Languages is "Unicode (UTF-8)", then the resulting
Accept-Charset header becomes:
Accept-Charset: UTF-8, utf-8; q=0.667, *; q=0.667
which seemingly is legal but redundant.
Comment 40•24 years ago
|
||
Koenig: whoops, you're right... the patch is designed to avoid the duplicate
"utf-8", but it doesn't check for case. Change line 116 of the patch from:
+ if (PL_strstr(acceptable, "utf-8") == NULL) {
to
+ if (PL_strcasestr(acceptable, "utf-8") == NULL) {
and that should do the trick.
Comment 41•24 years ago
|
||
Also, while the Language Preference screen won't let you do it, the above patch
will allow a comma separated list of character set/encodings in the
intl.charset.default, which you can set by manually editing your prefs.js.
Nothing else seems to use intl.charset.default (true?), but if something else
isn't expecting a comma delimited tokens in that preference, this could get you
into trouble.
Reporter | ||
Comment 42•24 years ago
|
||
intl.charset.default must be a single item entry.
(No comma delimited list should be in it -- it
defeats the prupose of this pref!)
It is your default fallback encoding for browsing
in case HTTP, HTTP Meta-Equiv, or Auto-detection
cannot give you a document charset.
For Composer, it is used as the default encoding
for a new document.
This value should be set by a localizer to be
suitable for each locale. It has a UI also:
Edit | Prefs | Navigator | Languages | Character Coding.
Comment 43•24 years ago
|
||
Understood. The above patch is still o.k., because while _it_ can handle a comma
delimited list, it doesn't add a comma list to the pref itself-- just a little
bit of (unneeded for now, until the patch is changed to use a preference other
then intl.charset.default) robustness-- access to intl.charset.default is
read-only.
Comment 44•24 years ago
|
||
http bugs to "Networking::HTTP"
Assignee: gagan → darin
Status: REOPENED → NEW
Component: Internationalization → Networking: HTTP
QA Contact: momoi → tever
Target Milestone: Future → M19
Comment 46•24 years ago
|
||
Comment 47•24 years ago
|
||
Assignee | ||
Comment 48•24 years ago
|
||
Looks good. r=darin
Assignee | ||
Comment 50•24 years ago
|
||
Fix checked in.
Status: NEW → RESOLVED
Closed: 25 years ago → 24 years ago
Resolution: --- → FIXED
Comment 51•24 years ago
|
||
You can check what mozilla sends at:
http://gemal.dk/browserspy/accept.cgi
Comment 52•24 years ago
|
||
Henrik Gemal wrote:
> You can check what mozilla sends at:
> http://gemal.dk/browserspy/accept.cgi
or you can use
http://www.mozilla.gr.jp:4321/
which is step B20 of the smoketests at
http://www.mozilla.org/quality/smoketests/
You need to log in
before you can comment on or make changes to this bug.
Description
•