Closed Bug 61269 Opened 24 years ago Closed 23 years ago

'%' will get URL-escaped to "%25" if not followed by 2 hexadecimal digits

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla0.9.6

People

(Reporter: baldauf--2015--bugzilla.mozilla.org, Assigned: dougt)

References

()

Details

Attachments

(2 files, 1 obsolete file)

From Bugzilla Helper: User-Agent: Mozilla/4.75 [en] (Win98; U) BuildID: 200111908 Mozilla escapes '%' characters in a URL to "%25" if the '%' is not followed by two hexadecimal digits [0-9a-fA-F]. This URL both may be contained in a <A HREF=>-Statement or may be typed in directly into the URL bar. This escaping is very arbitrary and should be avoided due to its arbitrarity and incompatibility with current browsers. Just never encode '%', because '%' already is the encoding character. Reproducible: Always Steps to Reproduce: 1. Go to http://justchat4.medium.net:4000/chat/test/URLTest.html 2. Click on the link on that page. Actual Results: You will see (using mozilla): Your data supplied is %u20AC test â0AC. Expected Results: You should see (as with Netscape 4.7 and IE5): Your data supplied is € test â0AC. You will see that mozilla escapes "%u20AC" but it keeps "%e20AC", so escaping is very arbitrary. Mozilla should not "know better" how to handle such URLs, it just should pass them 1:1 to the server. Why? As a transition-mechanism, I have to use %uHHHH-escapes to denote UTF16-characters not found in ASCII. Because I cannot know wether browsers URL-charset is Latin1, UTF8 or something else (and because URLs don't offer a reliable way to detect it), I need to use this. Escaping with '%' but not with a hexadecimal character after that '%' is one of the remaining namespace in URLs to be used. This mechanism is site specific and free of namespace conflicts as the unreliable Latin1 vs UTF8 transition. As browsers have to pass URLs as given (at least to the site which generated the URL), I consider this a bug.
From RFC 1738 (the RFC on Uniform Resource Locators): The character "%" is unsafe because it is used for encodings of other characters. [...] All unsafe characters must always be encoded within a URL. So in cases when the % is not obviously being used to encode another character, it itself must be encoded. Ideally, this would be done by the person writing the URL. That is, you would have in your page: "%25u20AC test %25e20AC" if what you want to get passed to the server is "%u20AC test %e20AC". I would recommend that this be marked invalid as Mozilla's behavior is correct per the RFC.
The character "%" is unsafe because it is used for encodings of other characters. [...] All unsafe characters must always be encoded within a URL. You know that this is a recursive definition. If you follow this thinking exactly, you would to have encode "%" to "%25", which already contains a '%', so you would have to encode it again zo "%2525", this to "%252525" and so on, leading to an infinitely long string. This can't be the right interpretation. To get a non-infinitely long result, you have to interprete a "%" as an escape character and "%25" as the literal percent sign. If you interprete this way, you must not escape an escape sign. If you do that, escaping is impossible. Why do I report this as bug? Because else there is _no_ clear way where theory and practice are consistent in the "transport" of an UTF16-value to the browser and back. When I receive "%da%bf" within an URL, the server _cannot_ know wether this is latin-1 encoded (as browsers frequently do) or wether this is UTF-8 (as the standard recently says). To solve the problem, at least for the case when I have to pass UTF-16 strings from the server to the browser and back, I encode every UTF-16 value which is not within ASCII by "%uHHHH", where H is one hex digit. This is unambigous and clear. So when you think that '%' is an escape character to escape other characters and "%25" is the literal percent sign, you can be happy. My '%' escapes other characters. It's clear that escape characters must not be escaped within the same information representation layer. But this is exactly what mozilla does. I may want to learn how I should transport UTF-16 characters in a namespace-safe fashion. UTF-8 over %HH is not namespace-safe, because it could be interpreted as latin-1. UTF-7 is not namespace-safe either. I do not want my server to make guesswork, because guessing can always be wrong. Give me a namespace-safe, unambiguous way to encode UTF-16 characters within URLs for all browsers. The %uHHHH-way is the only way I know of. Note: all URLs point only back my server, no other server are involved with %uHHHH-URLs.
I also do not know where encoding and escaping the escape character '%' is useful, give me an example. :o)
mozilla trys to detect if a string given to it is already escaped or not, because escaping an already escaped string is bad and over-"un"escaping is too. I know, there are certain cornercases where the algorithm failes, but it does work most of the time and that is the best we can hope. There simply is no way to do this right all the time. If you want your % characters untouched, then escape them properly. mozilla provides an escaping mask that forces the escaping of characters. This mask is used in cases when we know we are dealing with real % signs. Maybe it can also be used when converting between different charsets. I don't think the current behaviour can be changed, it would break tons of sites. Sorry.
Marking as Invalid as per comments.
Status: UNCONFIRMED → RESOLVED
Closed: 24 years ago
Resolution: --- → INVALID
Sorry, but it won't break ton's of sites, because the behaviour of IE, Netscape Communicator 4.7 and Opera is "Don't escape a '%', because you won't get rid of the '%', it would appear in the resulting '%25'". You see that escaping the escape character within the same logical level is silly. There is no single site that would break, because that site would break when being viewed with every non mozilla-browser, too. Trust me or disprove it. :-) Here's a patch: diff -Nur netwerk/base/src/nsURLHelper.cpp.orig netwerk/base/src/nsURLHelper.cpp --- netwerk/base/src/nsURLHelper.cpp.orig Fri Dec 8 20:14:56 2000 +++ netwerk/base/src/nsURLHelper.cpp Fri Dec 8 20:51:36 2000 @@ -121,11 +121,19 @@ c2[0] = *(src+2); unsigned char c = *src++; - /* if the char has not to be escaped or whatever follows % is - a valid escaped string, just copy the char */ - if (IS_OK(c) || (c == HEX_ESCAPE && !(forced) && (pc1) && (pc2) && - PL_strpbrk(pc1, CheckHexChars) != 0 && - PL_strpbrk(pc2, CheckHexChars) != 0)) { + /* if the char has not to be escaped or is the '%' char, just copy the + char. Why not escaping the escape char '%'? Because + (1) escaping the escape char is silly, and + (2) some people use the '%' to get access to some unused namespace + to work around the undefined URL-charset problem (RFC1630 tells + users to use Latin-1, while RFC2396 recommends UTF-8). To make + the URLs received from the server at least for that case + unambiguous in that the server wrote the URL without knowing + the charset the browser prefers, some servers write + %uHHHH-encoded UTF-16 chars into URLs referencing the server + itself. Encoding a '%' would break that short-term solution. + */ + if (IS_OK(c) || (c == HEX_ESCAPE && !(forced)) { tempBuffer[tempBufferPos++]=c; } else
Status: RESOLVED → UNCONFIRMED
Hardware: PC → All
Resolution: INVALID → ---
Example. I have a file with a % in its name. I would like to link to it using the actual name of the file and let the browser handle the escaping. If I do that with NS 4.x it breaks: "Your browser sent a request that this server could not understand." With Mozilla I can retrieve the file just fine. There is no reason to require users to type escape codes into the url entry field if they want to go to such a file.
Boris, suppose your filename is "AbC%dEf.html" and this file is within the directory accessible by "http://justchat4.medium.net:4000/chat/test/". Will you refer this file with http://justchat4.medium.net:4000/chat/test/AbC%dEf.html or with http://justchat4.medium.net:4000/chat/test/AbC%25dEf.html ? Try it out, with Netscape 4.7 or Mozilla 0.6, in which cases will you succeed? You can try this with every server and every client out there. The behaviour will be identical, the first URL would not work, the second URL form would. <QUOTE> There is no reason to require users to type escape codes into the url entry field if they want to go to such a file. </QUOTE> There is a reason. It's the same reason why users should not write quotation marks '"', question marks '?', hash marks '#', exclamation marks '!', spaces ' ' , tabulators and certain other characters within their URLs. It is because all of those characters have a special meaning, regardless of the meaning, it is special. If the user does not want to trigger such a special meaning, she|he has to avoid to write those characters in literal form. The user may use the URL escape mechanism to denote such a special character in its literal meaning.
Here a cleaned version of the patch. The last patch contains a typo (missing right parenthesis). The resulting code also will be faster. diff -Nur netwerk/base/src/nsURLHelper.cpp.orig netwerk/base/src/nsURLHelper.cpp --- nsURLHelper.cpp.orig Fri Dec 8 20:14:56 2000 +++ nsURLHelper.cpp Mon Dec 11 22:58:38 2000 @@ -93,7 +93,6 @@ int i = 0; char* hexChars = "0123456789ABCDEF"; - static const char CheckHexChars[] = "0123456789ABCDEFabcdef"; int len = PL_strlen(str); PRBool forced = PR_FALSE; @@ -107,25 +106,23 @@ char tempBuffer[100]; unsigned int tempBufferPos = 0; - char c1[] = " "; - char c2[] = " "; - char* const pc1 = c1; - char* const pc2 = c2; - for (i = 0; i < len; i++) { - c1[0] = *(src+1); - if (*(src+1) == '\0') - c2[0] = '\0'; - else - c2[0] = *(src+2); unsigned char c = *src++; - /* if the char has not to be escaped or whatever follows % is - a valid escaped string, just copy the char */ - if (IS_OK(c) || (c == HEX_ESCAPE && !(forced) && (pc1) && (pc2) && - PL_strpbrk(pc1, CheckHexChars) != 0 && - PL_strpbrk(pc2, CheckHexChars) != 0)) { + /* if the char has not to be escaped or is the '%' char, just copy the + char. Why not escaping the escape char '%'? Because + (1) escaping the escape char is silly, and + (2) some people use the '%' to get access to some unused namespace + to work around the undefined URL-charset problem (RFC1630 tells + users to use Latin-1, while RFC2396 recommends UTF-8). To make + the URLs received from the server at least for that case + unambiguous in that the server wrote the URL without knowing + the charset the browser prefers, some servers write + %uHHHH-encoded UTF-16 chars into URLs referencing the server + itself. Encoding a '%' would break that short-term solution. + */ + if (IS_OK(c) || (c == HEX_ESCAPE && !(forced))) { tempBuffer[tempBufferPos++]=c; } else
Okay Xuan, I'm beginning to see your point. The cases Boris and I were talking about can be done using the forced-URLmask. That is the mechanism that can be used to force the escape of the % sign. Sometimes that has to be done, there is no way around that. There are numerous examples with files that contain % signs that have to escape the % to preserve the filename. But that mechanism is not disturbed by your patch, at least not when the file is acessed over the file selection box or the web location box or similar access methods. Direct access to an url over the urlbar has to confirm the parsing rules of an url. Sometimes that includes the usage of the escape char by the user. We have to go through a massive amount of related closed bug reports and check with every case. Of special interest are mail adresses that replace the @ with % and so on. This will take some time and mine on the project is currently very very limited, so any help is very much appreciated. I have confirmed the report to make it more visible.
Status: UNCONFIRMED → NEW
Ever confirmed: true
I've checked against some of the old known bugs and couldn't find anything negative using the patch. cc'ing ftang for some internationalisation/filename encoding issues. Frank, can you take a look at this from your point and do some tests with the patch applied?
Hi, folks! For three months, nothing has been happening. Is there any plan|timeframe to apply the patch to the main tree? There don't seem to be objections (anymore) so far.
I currently have no time to look after this stuff. I'm still running with the patch for months now and haven't seen any problems. But I'm still not sure about some of the character encoding aspects of this.
<quote> I currently have no time to look after this stuff. I'm still running with the patch for months now and haven't seen any problems. </quote> Maybe you do not see any problems because this patch makes mozilla behave like all other browsers out there. ;-) Is it possible to apply this patch to the nightly build trees so that we have a greater test base? <quote> But I'm still not sure about some of the character encoding aspects of this. </quote> What are your uncertainties about? You do not feel good because you leave the namespace of %xx (where x is not a hexadecimal character) open|undefined by current standards? Maybe I can help solving such problems. :-)
It's simple: I can't check all of the old bugs related with url parsing, some of them involve special netscape test servers I have no access to, mainly some internationalisation stuff, charset encoding and conversion, something like that. If someone could do some quick tests that there are no regressions in that area I would agree with you that the patch should be applied to the tree.
mass move, v2. qa to me.
QA Contact: tever → benc
benc can u verify andreas suggestion? thx
I can run a test suite, but I do not think I have sufficient experience to write a test suite. Can someone provide some technical backup for me here?
Target Milestone: --- → mozilla1.0
Keywords: makingtest, qawanted
Blocks: 35209
If the code cleanup in bug 94796 becomes active this patch has to move to xpcom/io/nsEscape.cpp
Attached patch patch in the new nsEscape world (deleted) — Splinter Review
Attachment #20824 - Attachment is obsolete: true
*** Bug 93095 has been marked as a duplicate of this bug. ***
Keywords: patch
It seems some more explanation is necessary to get a r=/sr= on this bug: What will happen with this patch applied? We will no longer escape a % which is not a part of an escaped octet until we are forced to do so. Currently we escape abc%xx to abc%25xx. This is in conformance with rfc 2396, section 2.4.2: Because the percent "%" character always has the reserved purpose of being the escape indicator, it must be escaped as "%25" in order to be used as data within a URI. So everything seems fine. However it is not necessary to escape the % in this case since there is no danger of misinterpretation. In fact there seem to be several applications/cgis that build on not escaping a lonely %. Since there is no danger we should drop escaping a lonely % in favour of compatibility with other browsers/cgis. We can still handle urls that escape a lonely %, but we will not do it ourselfs until forced with the esc_Forced flag.
<cite> What will happen with this patch applied? We will no longer escape a % which is not a part of an escaped octet until we are forced to do so. Currently we escape abc%xx to abc%25xx. This is in conformance with rfc 2396, section 2.4.2: Because the percent "%" character always has the reserved purpose of being the escape indicator, it must be escaped as "%25" in order to be used as data within a URI. </cite> I think that this is _not_ in conformance with rfc 2396, because this interpretation is recursive. If you say that it is okay to escape an "%" to "%25", it must be okay to escape a "%25" to "%2525" and then to "%252525" and so on. You see that a recursive interpretation of that rule does not work out. There is an alternative view on that problem. See two representations of a URL, one "plaintext" and one "ciphertext": plaintext: { protocol: "http"; host: "bugzilla.mozilla.org"; port: null; rel_path: "/tes%test.html" } ciphertext: "http://bugzilla.mozilla.org/tes%25test.html" The rule is - in this interpretation - to represent a "%" of the plaintext by a "%25" within ciphertext. That is - in my opinion - much clearer than a recursive interpretation. Thus, representing a "%" with a "%25" happens only within a transformation from plaintext to ciphertext. Now, if you (as a browser) encounter some HTML text <FRAME SRC="abc/xy%u123456.html" /> what representation of URL is contained within the attribute "SRC" of tag "FRAME"? It's ciphertext, not plaintext. Thus, it is not to be transformed into ciphertext (because it is ciphertext already). And thus, a substitution to something like "abc/xy%25u123456.html" must not happen. Yet, it does happen, for a more or less good reason: People sometimes like to use the ciphertext as plaintext. If they want to reference a folder "my files", they write "my files" instead of "my%20files" into places where URLs are requested, thus forming an URL which contains illegal characters (the space mark ' '). Mozilla is forgiving about this and applies the plainttext-to-ciphertext-encoding onto such URL strings (which are, by format definition, already ciphertext). The problem is that blindly applying such an encoding twice results in formally correct but textual wrong URLs, which is, in my opinion, a bug. I hope that explanation helps to get a r=|sr=. If further explanation is needed: just ask. :-)
The definition in rfc2396 is not really recursive because it says "be careful to not escape a string more than once" or something similar. Escaping (nsStdEscape) in mozilla respects this and (until not forced by esc_Forced) will not escape an already escaped octet. Otherwise an %20 would also become %2520, %252520, and so on. I think we should not waste our energies on this interpretation aspect of an rfc that has already proven to be sometimes misleading (query handling!). I think we all agree this patch should be applied, not depending on if the current behaviour is correct according to rfc 2396 or not.
I warmly agree. :-)
Comment on attachment 46106 [details] [diff] [review] patch in the new nsEscape world how about adding some of Xuan's comments from his first patch? .. or at least reference this bug, so casual readers of the code will know why this is being done.
Attachment #46106 - Flags: superreview+
Comment on attachment 54507 [details] [diff] [review] same patch as before, but including a comment referencing the bug carryover sr= from darin's review. also adding r=dougt
Attachment #54507 - Flags: superreview+
Attachment #54507 - Flags: review+
gagan, can you check this in?
Thanks Doug! I can check this in if you want.
Hmmmm ... unless anyone objects, I will check this in over the weekend.
sounds great. If you miss, I will check it in on monday.
Assignee: gagan → dougt
Target Milestone: mozilla1.0 → mozilla0.9.6
fix finally checked in. Wohowwww!!!
Status: NEW → RESOLVED
Closed: 24 years ago23 years ago
Resolution: --- → FIXED
*** Bug 194416 has been marked as a duplicate of this bug. ***
Keywords: qawanted
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: