Closed Bug 36820 Opened 25 years ago Closed 25 years ago

Beginning of body text cut off (cyrillic chars in HEAD)

Categories

(Core :: Internationalization, defect, P3)

x86
Linux
defect

Tracking

()

VERIFIED INVALID

People

(Reporter: ilusha, Assigned: ftang)

References

()

Details

(Keywords: testcase)

Attachments

(2 files)

From Bugzilla Helper: User-Agent: Mozilla/4.61 [en] (X11; I; Linux 2.2.12 i686) BuildID: 2000041805 My experimentation has shown that the problem is caused by Cyrillic This is Russian text in UTF-8 encoding apparently produced by MS-Word 97 Text is shown only starting some point in Chapter 2. Portion of text at the beginning isn't shown Reproducible: Always Steps to Reproduce: 1.Load the specified URL 2. 3. Actual Results: Text is shown from some point in the middle Expected Results: Show text from the beginning This is Russian text in UTF-8 encoding produced apparently by MS-Word97. My experimentation has shown that the problem is caused by Cyrillic text in 'CONTENT' attribute of tag <META NAME="Keywords> in the HTML header. (If I remove the Cyrillic text, page is shown OK)
Confirming bug (tested PC/Linux, build 2000042113). There generally seems to be a problem with cyrillic characters in the HEAD. Going to attach some testcases. Note that Navigator 4.x always displays the body text.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Extending summary, adding testcase keyword.
Keywords: testcase
Summary: Beginning of text cut of. → Beginning of body text cut off (cyrillic chars in HEAD)
updating component
Assignee: asadotzler → ftang
Component: Browser-General → Internationalization
QA Contact: jelwell → teruko
This is an invalid bug. The test cases include non UTF8 in the head. The byte in between the <TITLE> and </TITLE> is 0xD1 which is not an valid UTF-8 character. According to UTF-8 http://www.cis.ohio-state.edu/htbin/rfc/rfc2279.html, 0xD1 in UTF-8 is the first byte of a two bytes UTF-8 character. It should have a 2nd byte in the range of 0x80 to 0xBF UCS-4 range (hex.) UTF-8 octet sequence (binary) 0000 0000-0000 007F 0xxxxxxx 0000 0080-0000 07FF 110xxxxx 10xxxxxx .... Here is the od dump of that page- D:\>od --format=x1 s*.htm 0000000 3c 48 54 4d 4c 3e 0d 0d 0a 3c 48 45 41 44 3e 0d 0000020 0d 0a 3c 4d 45 54 41 20 48 54 54 50 2d 45 51 55 0000040 49 56 3d 22 43 6f 6e 74 65 6e 74 2d 54 79 70 65 0000060 22 20 43 4f 4e 54 45 4e 54 3d 22 74 65 78 74 2f 0000100 68 74 6d 6c 3b 20 63 68 61 72 73 65 74 3d 75 74 0000120 66 2d 38 22 3e 0d 0d 0a 3c 54 49 54 4c 45 3e d1 0000140 3c 2f 54 49 54 4c 45 3e 0d 0d 0a 3c 2f 48 45 41 0000160 44 3e 0d 0d 0a 3c 42 4f 44 59 3e 0d 0d 0a 68 65 0000200 6c 6c 6f 20 77 6f 72 6c 64 2e 0d 0d 0a 3c 2f 42 0000220 4f 44 59 3e 0d 0d 0a 3c 2f 48 54 4d 4c 3e 0d 0d 0000240 0a 0000241 D:\>od --format=c s*.htm 0000000 < H T M L > \r \r \n < H E A D > \r 0000020 \r \n < M E T A H T T P - E Q U 0000040 I V = " C o n t e n t - T y p e 0000060 " C O N T E N T = " t e x t / 0000100 h t m l ; c h a r s e t = u t 0000120 f - 8 " > \r \r \n < T I T L E > 321 0000140 < / T I T L E > \r \r \n < / H E A 0000160 D > \r \r \n < B O D Y > \r \r \n h e 0000200 l l o w o r l d . \r \r \n < / B 0000220 O D Y > \r \r \n < / H T M L > \r \r 0000240 \n Notice it is a singel 0xd1 there, which make the converter think the next '<' is the 2nd byte of the UTF-8 and make the whole document in the <TITLE> part. Mark this bug invalid since the test case itself is not valid UTF-8.
Status: NEW → RESOLVED
Closed: 25 years ago
Resolution: --- → INVALID
Reopening. ftang@netscape.com: I'm sorry for the invalid testcase. Please ignore it. I'm not an expert in character encodings, but since the document at the originally reported URL was generated by MS Word, I doubt that it has the same error than my testcase. I'm only a prescreener, and since I could reproduce the problem, I confirmed it to avoid unneccessary delay. Please reevaluate this bug as if I had not added any comments. I promise I won't create testcases for Internationalization bugs in the future.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
mark it as assign M16
Status: REOPENED → NEW
Target Milestone: --- → M16
Status: NEW → ASSIGNED
The problem is that file is NOT in UTF-8. I run through http://people.netscape.com/ftang/utf8/isutf8.html and it said it is NOT valid UTF-8.
at least line 7 is not valid UTF-8 C:\>od --format=x1 tt 0000000 3c 4d 45 54 41 20 4e 41 4d 45 3d 22 4b 65 79 77 0000020 6f 72 64 73 22 20 43 4f 4e 54 45 4e 54 3d 22 50 0000040 68 69 6c 6f 73 68 6f 70 79 2c 20 f4 e8 eb ee f1 0000060 ee f4 e8 ff 2c 20 d0 ee e7 e0 ed ee e2 22 3e 0d 0000100 0d 0a 0000102 C:\>od -c tt 0000000 < M E T A N A M E = " K e y w 0000020 o r d s " C O N T E N T = " P 0000040 h i l o s h o p y , 364 350 353 356 361 0000060 356 364 350 377 , 320 356 347 340 355 356 342 " > \r 0000100 \r \n 0000102 Notice the first non ASCII in line 7 is 0xF4 According to UTF-8 (http://www.cis.ohio-state.edu/htbin/rfc/rfc2279.html), 0xF4 is a lead byte of a 5 bytes UTF-8. The rest 4 bytes should be in the range of 0x80-0xbf 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx However, the 4 bytes followed 0xf4 are "e8 eb ee f1", not in the range of UTF-8 Mark this bug as invalid. It is the page itself buggy, not the mozilla.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → INVALID
Verified as Invalid.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: