Closed
Bug 150373
Opened 23 years ago
Closed 23 years ago
Malformed Unicode entities are rendered forgivingly
Categories
(Core :: DOM: HTML Parser, defect)
Tracking
()
VERIFIED
INVALID
People
(Reporter: fun, Assigned: harishd)
Details
(Keywords: testcase)
Attachments
(2 files)
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.0.0) Gecko/20020530
BuildID: 2002053012
I noticed this when testing my home page - ’: (note that's a colon,
not a semicolon) rendered as close-single-quote colon, rather than the
explicit ASCII characters "’:" as it should have.
This forgiving rendering of broken entities is a problem when you're
trying to write a compatible page - IE 5 rendered forgivingly too, but
Links did not, for instance. (It wasn't until then that I noticed I'd
gotten it wrong.)
Reproducible: Always
Steps to Reproduce:
1. Put “: in your document.
2. See how it renders.
Actual Results: ": (that first being a smart close quote, not a dumb quote)
Expected Results: “:
Test case attached with demonstrations of correct Unicode entities,
and two types of malformed entities: ones ending in a colon and ones
ending in a space.
The attached is quirks mode, but it does the same in standard mode.
I'm using 1.0-final, haven't tested on the trunk.
Reporter | ||
Comment 1•23 years ago
|
||
Here's the source code of this small test case:
<html><head><title> </title></head>
<body>
<p>Here’s a ‘test case’: &#8217:</p>
<p>Here’s a ‘test case’ &#8217</p>
<p>Here’s a ‘test case’: &#8217; (correct)</p>
<p>Here’s a “test case”: &#8221:</p>
<p>Here’s a “test case” &#8221</p>
<p>Here’s a “test case”: &#8221; (correct)</p>
<p>Here’s a –test case–: &#8211:</p>
<p>Here’s a –test case– &#8211</p>
<p>Here’s a –test case–: &#8211; (correct)</p>
</body></html>
is this only in quirks mode or standard compatible mode too?
OS: Windows 98 → All
Reporter | ||
Comment 3•23 years ago
|
||
As I said in comment #0 it's in standards and in quirks mode. I'll attach a
standards mode testcase. (4.01 strict is standards mode, isn't it?)
Reporter | ||
Comment 4•23 years ago
|
||
Here's the HTML for this small test case:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title> </title>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
</head>
<body>
<p>Here’s a ‘test case’: &#8217:</p>
<p>Here’s a ‘test case’ &#8217</p>
<p>Here’s a ‘test case’: &#8217; (correct)</p>
<p>Here’s a “test case”: &#8221:</p>
<p>Here’s a “test case” &#8221</p>
<p>Here’s a “test case”: &#8221; (correct)</p>
<p>Here’s a –test case–: &#8211:</p>
<p>Here’s a –test case– &#8211</p>
<p>Here’s a –test case–: &#8211; (correct)</p>
</body></html>
No, it still renders that in quirks mode, because you use Transitional in your
!DOCTYPE. if you remove that, it renders in Standards Compliant mode (4.01
Strict). But the entities are still there.. so it's a valid bug.
(sorry I didn't see your line about quirks / standard mode)
and, it doesn't care about the colon, really, “ renders as “
Assignee: attinasi → harishd
Status: UNCONFIRMED → NEW
Component: Layout → Parser
Ever confirmed: true
QA Contact: petersen → moied
http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsHTMLEntities.cpp#180
seems like this is definetly on purpose. what does the spec say, anyway?
Comment 7•23 years ago
|
||
Omission of REFC (reference close, normally the semicolon) is permitted by SGML
(it may be limited to SHORTTAG contexts; I don't recall exactly, but it's
applicable to HTML) when the characters of the entity reference are followed by
characters which cannot be part of the entity reference, i.e., characters which
are not SGML name characters (see ISO 8879, section 9.5.4). This is correct
behavior.
Reporter | ||
Comment 8•23 years ago
|
||
Maybe SGML allows that, but does HTML?
http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1 says:
Numeric character references specify the code position of a character in the
document character set. Numeric character references may take two forms:
* The syntax "&#D;", where D is a decimal number, refers to the ISO 10646
decimal character number D.
* The syntax "&#xH;" or "&#XH;", where H is a hexadecimal number, refers to
the ISO 10646 hexadecimal character number H. Hexadecimal numbers in
numeric character references are case-insensitive.
It doesn't seem to leave one with the option to just leave the ';' off. Are you
sure this is invalid? Although I'll happily file a bug on links instead if it
is :-)
Comment 9•23 years ago
|
||
HTML 4.01 normatively refers to ISO 8879. How seriously that's to be
taken...well. But the section just above <URL:http://www.w3.org/TR/html401/
charset.html#entities> makes reference to this possibility, which applies to both
types of entity references enumerated there (Sections 5.3.1 and 5.3.2). Since a
document containing such entity references would of course be valid, I don't see
why we shouldn't support the SGML syntax. (Which is not to say I don't appreciate
the stricter constraints in XML from a parsing standpoint, but we shouldn't try
to retrofit them onto pre-XML HTML.)
You need to log in
before you can comment on or make changes to this bug.
Description
•