Closed
Bug 61269
Opened 24 years ago
Closed 23 years ago
'%' will get URL-escaped to "%25" if not followed by 2 hexadecimal digits
Categories
(Core :: Networking, defect, P3)
Core
Networking
Tracking
()
RESOLVED
FIXED
mozilla0.9.6
People
(Reporter: baldauf--2015--bugzilla.mozilla.org, Assigned: dougt)
References
()
Details
Attachments
(2 files, 1 obsolete file)
(deleted),
patch
|
darin.moz
:
superreview+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dougt
:
review+
dougt
:
superreview+
|
Details | Diff | Splinter Review |
From Bugzilla Helper:
User-Agent: Mozilla/4.75 [en] (Win98; U)
BuildID: 200111908
Mozilla escapes '%' characters in a URL to "%25" if the '%' is not followed by
two hexadecimal digits [0-9a-fA-F]. This URL both may be contained in a <A
HREF=>-Statement or may be typed in directly into the URL bar. This escaping is
very arbitrary and should be avoided due to its arbitrarity and incompatibility
with current browsers. Just never encode '%', because '%' already is the
encoding character.
Reproducible: Always
Steps to Reproduce:
1. Go to http://justchat4.medium.net:4000/chat/test/URLTest.html
2. Click on the link on that page.
Actual Results: You will see (using mozilla):
Your data supplied is %u20AC test â0AC.
Expected Results: You should see (as with Netscape 4.7 and IE5):
Your data supplied is € test â0AC.
You will see that mozilla escapes "%u20AC" but it keeps "%e20AC", so escaping is
very arbitrary. Mozilla should not "know better" how to handle such URLs, it
just should pass them 1:1 to the server. Why? As a transition-mechanism, I have
to use %uHHHH-escapes to denote UTF16-characters not found in ASCII. Because I
cannot know wether browsers URL-charset is Latin1, UTF8 or something else (and
because URLs don't offer a reliable way to detect it), I need to use this.
Escaping with '%' but not with a hexadecimal character after that '%' is one of
the remaining namespace in URLs to be used. This mechanism is site specific and
free of namespace conflicts as the unreliable Latin1 vs UTF8 transition.
As browsers have to pass URLs as given (at least to the site which generated the
URL), I consider this a bug.
Comment 1•24 years ago
|
||
From RFC 1738 (the RFC on Uniform Resource Locators):
The character "%" is unsafe because it is used for encodings of other
characters. [...] All unsafe characters must always be encoded within
a URL.
So in cases when the % is not obviously being used to encode another character,
it itself must be encoded.
Ideally, this would be done by the person writing the URL. That is, you would
have in your page:
"%25u20AC test %25e20AC"
if what you want to get passed to the server is "%u20AC test %e20AC".
I would recommend that this be marked invalid as Mozilla's behavior is correct
per the RFC.
Reporter | ||
Comment 2•24 years ago
|
||
The character "%" is unsafe because it is used for encodings of other
characters. [...] All unsafe characters must always be encoded within
a URL.
You know that this is a recursive definition. If you follow this thinking
exactly, you would to have encode "%" to "%25", which already contains a '%',
so you would have to encode it again zo "%2525", this to "%252525" and so on,
leading to an infinitely long string. This can't be the right interpretation.
To get a non-infinitely long result, you have to interprete a "%" as an escape
character and "%25" as the literal percent sign. If you interprete this way, you
must not escape an escape sign. If you do that, escaping is impossible.
Why do I report this as bug? Because else there is _no_ clear way where theory
and practice are consistent in the "transport" of an UTF16-value to the browser
and back. When I receive "%da%bf" within an URL, the server _cannot_ know wether
this is latin-1 encoded (as browsers frequently do) or wether this is UTF-8 (as
the standard recently says). To solve the problem, at least for the case when I
have to pass UTF-16 strings from the server to the browser and back, I encode
every UTF-16 value which is not within ASCII by "%uHHHH", where H is one hex
digit. This is unambigous and clear. So when you think that '%' is an escape
character to escape other characters and "%25" is the literal percent sign, you
can be happy. My '%' escapes other characters. It's clear that escape characters
must not be escaped within the same information representation layer. But this
is exactly what mozilla does.
I may want to learn how I should transport UTF-16 characters in a namespace-safe
fashion. UTF-8 over %HH is not namespace-safe, because it could be interpreted
as latin-1. UTF-7 is not namespace-safe either. I do not want my server to make
guesswork, because guessing can always be wrong. Give me a namespace-safe,
unambiguous way to encode UTF-16 characters within URLs for all browsers.
The %uHHHH-way is the only way I know of.
Note: all URLs point only back my server, no other server are involved with
%uHHHH-URLs.
Reporter | ||
Comment 3•24 years ago
|
||
I also do not know where encoding and escaping the escape character '%' is
useful, give me an example. :o)
Comment 4•24 years ago
|
||
mozilla trys to detect if a string given to it is already escaped or not,
because escaping an already escaped string is bad and over-"un"escaping is too.
I know, there are certain cornercases where the algorithm failes, but it does
work most of the time and that is the best we can hope. There simply is no way
to do this right all the time. If you want your % characters untouched, then
escape them properly. mozilla provides an escaping mask that forces the escaping
of characters. This mask is used in cases when we know we are dealing with real
% signs. Maybe it can also be used when converting between different charsets.
I don't think the current behaviour can be changed, it would break tons of
sites. Sorry.
Comment 5•24 years ago
|
||
Marking as Invalid as per comments.
Status: UNCONFIRMED → RESOLVED
Closed: 24 years ago
Resolution: --- → INVALID
Reporter | ||
Comment 6•24 years ago
|
||
Sorry, but it won't break ton's of sites, because the behaviour of IE, Netscape
Communicator 4.7 and Opera is "Don't escape a '%', because you won't get rid of
the '%', it would appear in the resulting '%25'". You see that escaping the
escape character within the same logical level is silly. There is no single site
that would break, because that site would break when being viewed with every non
mozilla-browser, too. Trust me or disprove it. :-)
Here's a patch:
diff -Nur netwerk/base/src/nsURLHelper.cpp.orig netwerk/base/src/nsURLHelper.cpp
--- netwerk/base/src/nsURLHelper.cpp.orig Fri Dec 8 20:14:56 2000
+++ netwerk/base/src/nsURLHelper.cpp Fri Dec 8 20:51:36 2000
@@ -121,11 +121,19 @@
c2[0] = *(src+2);
unsigned char c = *src++;
- /* if the char has not to be escaped or whatever follows % is
- a valid escaped string, just copy the char */
- if (IS_OK(c) || (c == HEX_ESCAPE && !(forced) && (pc1) && (pc2) &&
- PL_strpbrk(pc1, CheckHexChars) != 0 &&
- PL_strpbrk(pc2, CheckHexChars) != 0)) {
+ /* if the char has not to be escaped or is the '%' char, just copy the
+ char. Why not escaping the escape char '%'? Because
+ (1) escaping the escape char is silly, and
+ (2) some people use the '%' to get access to some unused namespace
+ to work around the undefined URL-charset problem (RFC1630 tells
+ users to use Latin-1, while RFC2396 recommends UTF-8). To make
+ the URLs received from the server at least for that case
+ unambiguous in that the server wrote the URL without knowing
+ the charset the browser prefers, some servers write
+ %uHHHH-encoded UTF-16 chars into URLs referencing the server
+ itself. Encoding a '%' would break that short-term solution.
+ */
+ if (IS_OK(c) || (c == HEX_ESCAPE && !(forced)) {
tempBuffer[tempBufferPos++]=c;
}
else
Status: RESOLVED → UNCONFIRMED
Hardware: PC → All
Resolution: INVALID → ---
Comment 7•24 years ago
|
||
Example. I have a file with a % in its name. I would like to link to it using
the actual name of the file and let the browser handle the escaping.
If I do that with NS 4.x it breaks: "Your browser sent a request that this
server could not understand."
With Mozilla I can retrieve the file just fine. There is no reason to require
users to type escape codes into the url entry field if they want to go to such a
file.
Reporter | ||
Comment 8•24 years ago
|
||
Boris,
suppose your filename is "AbC%dEf.html" and this file is within the directory
accessible by "http://justchat4.medium.net:4000/chat/test/".
Will you refer this file with
http://justchat4.medium.net:4000/chat/test/AbC%dEf.html
or with
http://justchat4.medium.net:4000/chat/test/AbC%25dEf.html
?
Try it out, with Netscape 4.7 or Mozilla 0.6, in which cases will you succeed?
You can try this with every server and every client out there. The behaviour
will be identical, the first URL would not work, the second URL form would.
<QUOTE>
There is no reason to require users to type escape codes into the url entry
field if they want to go to such a file.
</QUOTE>
There is a reason. It's the same reason why users should not write quotation
marks '"', question marks '?', hash marks '#', exclamation marks '!', spaces ' '
, tabulators and certain other characters within their URLs. It is because all
of those characters have a special meaning, regardless of the meaning, it is
special. If the user does not want to trigger such a special meaning, she|he has
to avoid to write those characters in literal form. The user may use the URL
escape mechanism to denote such a special character in its literal meaning.
Reporter | ||
Comment 9•24 years ago
|
||
Here a cleaned version of the patch. The last patch contains a typo (missing
right parenthesis). The resulting code also will be faster.
diff -Nur netwerk/base/src/nsURLHelper.cpp.orig
netwerk/base/src/nsURLHelper.cpp
--- nsURLHelper.cpp.orig Fri Dec 8 20:14:56 2000
+++ nsURLHelper.cpp Mon Dec 11 22:58:38 2000
@@ -93,7 +93,6 @@
int i = 0;
char* hexChars = "0123456789ABCDEF";
- static const char CheckHexChars[] = "0123456789ABCDEFabcdef";
int len = PL_strlen(str);
PRBool forced = PR_FALSE;
@@ -107,25 +106,23 @@
char tempBuffer[100];
unsigned int tempBufferPos = 0;
- char c1[] = " ";
- char c2[] = " ";
- char* const pc1 = c1;
- char* const pc2 = c2;
-
for (i = 0; i < len; i++)
{
- c1[0] = *(src+1);
- if (*(src+1) == '\0')
- c2[0] = '\0';
- else
- c2[0] = *(src+2);
unsigned char c = *src++;
- /* if the char has not to be escaped or whatever follows % is
- a valid escaped string, just copy the char */
- if (IS_OK(c) || (c == HEX_ESCAPE && !(forced) && (pc1) && (pc2) &&
- PL_strpbrk(pc1, CheckHexChars) != 0 &&
- PL_strpbrk(pc2, CheckHexChars) != 0)) {
+ /* if the char has not to be escaped or is the '%' char, just copy the
+ char. Why not escaping the escape char '%'? Because
+ (1) escaping the escape char is silly, and
+ (2) some people use the '%' to get access to some unused namespace
+ to work around the undefined URL-charset problem (RFC1630 tells
+ users to use Latin-1, while RFC2396 recommends UTF-8). To make
+ the URLs received from the server at least for that case
+ unambiguous in that the server wrote the URL without knowing
+ the charset the browser prefers, some servers write
+ %uHHHH-encoded UTF-16 chars into URLs referencing the server
+ itself. Encoding a '%' would break that short-term solution.
+ */
+ if (IS_OK(c) || (c == HEX_ESCAPE && !(forced))) {
tempBuffer[tempBufferPos++]=c;
}
else
Comment 10•24 years ago
|
||
Okay Xuan, I'm beginning to see your point. The cases Boris and I were talking
about can be done using the forced-URLmask. That is the mechanism that can be
used to force the escape of the % sign. Sometimes that has to be done, there is
no way around that. There are numerous examples with files that contain % signs
that have to escape the % to preserve the filename. But that mechanism is not
disturbed by your patch, at least not when the file is acessed over the file
selection box or the web location box or similar access methods. Direct access
to an url over the urlbar has to confirm the parsing rules of an url. Sometimes
that includes the usage of the escape char by the user.
We have to go through a massive amount of related closed bug reports and check
with every case. Of special interest are mail adresses that replace the @ with %
and so on. This will take some time and mine on the project is currently very
very limited, so any help is very much appreciated.
I have confirmed the report to make it more visible.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 11•24 years ago
|
||
Comment 12•24 years ago
|
||
I've checked against some of the old known bugs and couldn't find anything
negative using the patch. cc'ing ftang for some internationalisation/filename
encoding issues. Frank, can you take a look at this from your point and do some
tests with the patch applied?
Reporter | ||
Comment 13•24 years ago
|
||
Hi, folks!
For three months, nothing has been happening.
Is there any plan|timeframe to apply the patch to the main tree? There don't
seem to be objections (anymore) so far.
Comment 14•24 years ago
|
||
I currently have no time to look after this stuff. I'm still running with the
patch for months now and haven't seen any problems. But I'm still not sure about
some of the character encoding aspects of this.
Reporter | ||
Comment 15•24 years ago
|
||
<quote>
I currently have no time to look after this stuff. I'm still running with the
patch for months now and haven't seen any problems.
</quote>
Maybe you do not see any problems because this patch makes mozilla behave like
all other browsers out there. ;-)
Is it possible to apply this patch to the nightly build trees so that we have a
greater test base?
<quote>
But I'm still not sure about some of the character encoding aspects of this.
</quote>
What are your uncertainties about? You do not feel good because you leave the
namespace of %xx (where x is not a hexadecimal character) open|undefined by
current standards? Maybe I can help solving such problems. :-)
Comment 16•24 years ago
|
||
It's simple: I can't check all of the old bugs related with url parsing, some of
them involve special netscape test servers I have no access to, mainly some
internationalisation stuff, charset encoding and conversion, something like
that.
If someone could do some quick tests that there are no regressions in that area
I would agree with you that the patch should be applied to the tree.
Comment 18•24 years ago
|
||
benc can u verify andreas suggestion? thx
Comment 19•24 years ago
|
||
I can run a test suite, but I do not think I have sufficient experience to write
a test suite.
Can someone provide some technical backup for me here?
Keywords: makingtest,
qawanted
Comment 20•23 years ago
|
||
If the code cleanup in bug 94796 becomes active this patch has to move to
xpcom/io/nsEscape.cpp
Comment 21•23 years ago
|
||
Updated•23 years ago
|
Attachment #20824 -
Attachment is obsolete: true
Comment 22•23 years ago
|
||
*** Bug 93095 has been marked as a duplicate of this bug. ***
Comment 23•23 years ago
|
||
It seems some more explanation is necessary to get a r=/sr= on this bug:
What will happen with this patch applied? We will no longer escape a % which is
not a part of an escaped octet until we are forced to do so. Currently we escape
abc%xx to abc%25xx. This is in conformance with rfc 2396, section 2.4.2:
Because the percent "%" character always has the reserved purpose of
being the escape indicator, it must be escaped as "%25" in order to
be used as data within a URI.
So everything seems fine. However it is not necessary to escape the % in this
case since there is no danger of misinterpretation. In fact there seem to be
several applications/cgis that build on not escaping a lonely %. Since there is
no danger we should drop escaping a lonely % in favour of compatibility with
other browsers/cgis. We can still handle urls that escape a lonely %, but we
will not do it ourselfs until forced with the esc_Forced flag.
Reporter | ||
Comment 24•23 years ago
|
||
<cite>
What will happen with this patch applied? We will no longer escape a % which is
not a part of an escaped octet until we are forced to do so. Currently we escape
abc%xx to abc%25xx. This is in conformance with rfc 2396, section 2.4.2:
Because the percent "%" character always has the reserved purpose of
being the escape indicator, it must be escaped as "%25" in order to
be used as data within a URI.
</cite>
I think that this is _not_ in conformance with rfc 2396, because this
interpretation is recursive. If you say that it is okay to escape an "%" to
"%25", it must be okay to escape a "%25" to "%2525" and then to "%252525" and so
on. You see that a recursive interpretation of that rule does not work out.
There is an alternative view on that problem. See two representations of a URL,
one "plaintext" and one "ciphertext":
plaintext: {
protocol: "http";
host: "bugzilla.mozilla.org";
port: null;
rel_path: "/tes%test.html"
}
ciphertext: "http://bugzilla.mozilla.org/tes%25test.html"
The rule is - in this interpretation - to represent a "%" of the plaintext by a
"%25" within ciphertext. That is - in my opinion - much clearer than a recursive
interpretation.
Thus, representing a "%" with a "%25" happens only within a transformation from
plaintext to ciphertext.
Now, if you (as a browser) encounter some HTML text
<FRAME SRC="abc/xy%u123456.html" />
what representation of URL is contained within the attribute "SRC" of tag
"FRAME"? It's ciphertext, not plaintext. Thus, it is not to be transformed into
ciphertext (because it is ciphertext already). And thus, a substitution to
something like "abc/xy%25u123456.html" must not happen.
Yet, it does happen, for a more or less good reason: People sometimes like to
use the ciphertext as plaintext. If they want to reference a folder "my files",
they write "my files" instead of "my%20files" into places where URLs are
requested, thus forming an URL which contains illegal characters (the space mark
' '). Mozilla is forgiving about this and applies the
plainttext-to-ciphertext-encoding onto such URL strings (which are, by format
definition, already ciphertext). The problem is that blindly applying such an
encoding twice results in formally correct but textual wrong URLs, which is, in
my opinion, a bug.
I hope that explanation helps to get a r=|sr=. If further explanation is needed:
just ask. :-)
Comment 25•23 years ago
|
||
The definition in rfc2396 is not really recursive because it says "be careful to
not escape a string more than once" or something similar.
Escaping (nsStdEscape) in mozilla respects this and (until not forced by
esc_Forced) will not escape an already escaped octet. Otherwise an %20 would
also become %2520, %252520, and so on.
I think we should not waste our energies on this interpretation aspect of an rfc
that has already proven to be sometimes misleading (query handling!). I think we
all agree this patch should be applied, not depending on if the current
behaviour is correct according to rfc 2396 or not.
Reporter | ||
Comment 26•23 years ago
|
||
I warmly agree. :-)
Comment 27•23 years ago
|
||
Comment on attachment 46106 [details] [diff] [review]
patch in the new nsEscape world
how about adding some of Xuan's comments from his first patch? .. or at least reference this
bug, so casual readers of the code will know why this is being done.
Attachment #46106 -
Flags: superreview+
Comment 28•23 years ago
|
||
Assignee | ||
Comment 29•23 years ago
|
||
Comment on attachment 54507 [details] [diff] [review]
same patch as before, but including a comment referencing the bug
carryover sr= from darin's review. also adding r=dougt
Attachment #54507 -
Flags: superreview+
Attachment #54507 -
Flags: review+
Assignee | ||
Comment 30•23 years ago
|
||
gagan, can you check this in?
Comment 31•23 years ago
|
||
Thanks Doug! I can check this in if you want.
Comment 32•23 years ago
|
||
Hmmmm ... unless anyone objects, I will check this in over the weekend.
Assignee | ||
Comment 33•23 years ago
|
||
sounds great. If you miss, I will check it in on monday.
Assignee: gagan → dougt
Target Milestone: mozilla1.0 → mozilla0.9.6
Comment 34•23 years ago
|
||
fix finally checked in. Wohowwww!!!
Status: NEW → RESOLVED
Closed: 24 years ago → 23 years ago
Resolution: --- → FIXED
Comment 35•22 years ago
|
||
*** Bug 194416 has been marked as a duplicate of this bug. ***
You need to log in
before you can comment on or make changes to this bug.
Description
•