Closed Bug 150373 Opened 23 years ago Closed 23 years ago

Malformed Unicode entities are rendered forgivingly

Categories

(Core :: DOM: HTML Parser, defect)

x86
All
defect
Not set
normal

Tracking

()

VERIFIED INVALID

People

(Reporter: fun, Assigned: harishd)

Details

(Keywords: testcase)

Attachments

(2 files)

From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.0.0) Gecko/20020530 BuildID: 2002053012 I noticed this when testing my home page - &#8217: (note that's a colon, not a semicolon) rendered as close-single-quote colon, rather than the explicit ASCII characters "&#8217:" as it should have. This forgiving rendering of broken entities is a problem when you're trying to write a compatible page - IE 5 rendered forgivingly too, but Links did not, for instance. (It wasn't until then that I noticed I'd gotten it wrong.) Reproducible: Always Steps to Reproduce: 1. Put &#8220: in your document. 2. See how it renders. Actual Results: ": (that first being a smart close quote, not a dumb quote) Expected Results: &#8220: Test case attached with demonstrations of correct Unicode entities, and two types of malformed entities: ones ending in a colon and ones ending in a space. The attached is quirks mode, but it does the same in standard mode. I'm using 1.0-final, haven't tested on the trunk.
Here's the source code of this small test case: <html><head><title> </title></head> <body> <p>Here&#8217;s a &#8216;test case&#8217: &amp;#8217:</p> <p>Here&#8217;s a &#8216;test case&#8217 &amp;#8217</p> <p>Here&#8217;s a &#8216;test case&#8217;: &amp;#8217; (correct)</p> <p>Here&#8217;s a &#8220;test case&#8221: &amp;#8221:</p> <p>Here&#8217;s a &#8220;test case&#8221 &amp;#8221</p> <p>Here&#8217;s a &#8220;test case&#8221;: &amp;#8221; (correct)</p> <p>Here&#8217;s a &#8211;test case&#8211: &amp;#8211:</p> <p>Here&#8217;s a &#8211;test case&#8211 &amp;#8211</p> <p>Here&#8217;s a &#8211;test case&#8211;: &amp;#8211; (correct)</p> </body></html>
is this only in quirks mode or standard compatible mode too?
OS: Windows 98 → All
As I said in comment #0 it's in standards and in quirks mode. I'll attach a standards mode testcase. (4.01 strict is standards mode, isn't it?)
Here's the HTML for this small test case: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html><head><title> </title> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> </head> <body> <p>Here&#8217;s a &#8216;test case&#8217: &amp;#8217:</p> <p>Here&#8217;s a &#8216;test case&#8217 &amp;#8217</p> <p>Here&#8217;s a &#8216;test case&#8217;: &amp;#8217; (correct)</p> <p>Here&#8217;s a &#8220;test case&#8221: &amp;#8221:</p> <p>Here&#8217;s a &#8220;test case&#8221 &amp;#8221</p> <p>Here&#8217;s a &#8220;test case&#8221;: &amp;#8221; (correct)</p> <p>Here&#8217;s a &#8211;test case&#8211: &amp;#8211:</p> <p>Here&#8217;s a &#8211;test case&#8211 &amp;#8211</p> <p>Here&#8217;s a &#8211;test case&#8211;: &amp;#8211; (correct)</p> </body></html>
No, it still renders that in quirks mode, because you use Transitional in your !DOCTYPE. if you remove that, it renders in Standards Compliant mode (4.01 Strict). But the entities are still there.. so it's a valid bug. (sorry I didn't see your line about quirks / standard mode) and, it doesn't care about the colon, really, &#8220 renders as &#8220;
Assignee: attinasi → harishd
Status: UNCONFIRMED → NEW
Component: Layout → Parser
Ever confirmed: true
QA Contact: petersen → moied
http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsHTMLEntities.cpp#180 seems like this is definetly on purpose. what does the spec say, anyway?
Omission of REFC (reference close, normally the semicolon) is permitted by SGML (it may be limited to SHORTTAG contexts; I don't recall exactly, but it's applicable to HTML) when the characters of the entity reference are followed by characters which cannot be part of the entity reference, i.e., characters which are not SGML name characters (see ISO 8879, section 9.5.4). This is correct behavior.
Status: NEW → RESOLVED
Closed: 23 years ago
Keywords: testcase
Resolution: --- → INVALID
Maybe SGML allows that, but does HTML? http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1 says: Numeric character references specify the code position of a character in the document character set. Numeric character references may take two forms: * The syntax "&#D;", where D is a decimal number, refers to the ISO 10646 decimal character number D. * The syntax "&#xH;" or "&#XH;", where H is a hexadecimal number, refers to the ISO 10646 hexadecimal character number H. Hexadecimal numbers in numeric character references are case-insensitive. It doesn't seem to leave one with the option to just leave the ';' off. Are you sure this is invalid? Although I'll happily file a bug on links instead if it is :-)
HTML 4.01 normatively refers to ISO 8879. How seriously that's to be taken...well. But the section just above <URL:http://www.w3.org/TR/html401/ charset.html#entities> makes reference to this possibility, which applies to both types of entity references enumerated there (Sections 5.3.1 and 5.3.2). Since a document containing such entity references would of course be valid, I don't see why we shouldn't support the SGML syntax. (Which is not to say I don't appreciate the stricter constraints in XML from a parsing standpoint, but we shouldn't try to retrofit them onto pre-XML HTML.)
Verified Invalid
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: