Closed
Bug 36820
Opened 25 years ago
Closed 25 years ago
Beginning of body text cut off (cyrillic chars in HEAD)
Categories
(Core :: Internationalization, defect, P3)
Tracking
()
VERIFIED
INVALID
M16
People
(Reporter: ilusha, Assigned: ftang)
References
()
Details
(Keywords: testcase)
Attachments
(2 files)
From Bugzilla Helper:
User-Agent: Mozilla/4.61 [en] (X11; I; Linux 2.2.12 i686)
BuildID: 2000041805
My experimentation has shown that the problem is caused by Cyrillic This is
Russian text in UTF-8 encoding apparently produced by MS-Word 97
Text is shown only starting some point in Chapter 2.
Portion of text at the beginning isn't shown
Reproducible: Always
Steps to Reproduce:
1.Load the specified URL
2.
3.
Actual Results: Text is shown from some point in the middle
Expected Results: Show text from the beginning
This is Russian text in UTF-8 encoding produced apparently by
MS-Word97.
My experimentation has shown that the problem is caused by
Cyrillic text in 'CONTENT' attribute of tag
<META NAME="Keywords>
in the HTML header. (If I remove the Cyrillic text, page is shown OK)
Comment 1•25 years ago
|
||
Confirming bug (tested PC/Linux, build 2000042113). There generally seems to be
a problem with cyrillic characters in the HEAD. Going to attach some testcases.
Note that Navigator 4.x always displays the body text.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 2•25 years ago
|
||
Comment 3•25 years ago
|
||
Comment 4•25 years ago
|
||
Extending summary, adding testcase keyword.
Keywords: testcase
Summary: Beginning of text cut of. → Beginning of body text cut off (cyrillic chars in HEAD)
Comment 5•25 years ago
|
||
updating component
Assignee: asadotzler → ftang
Component: Browser-General → Internationalization
QA Contact: jelwell → teruko
Assignee | ||
Comment 6•25 years ago
|
||
This is an invalid bug. The test cases include non UTF8 in the head. The byte in
between the <TITLE> and </TITLE> is 0xD1 which is not an valid UTF-8 character.
According to UTF-8 http://www.cis.ohio-state.edu/htbin/rfc/rfc2279.html, 0xD1 in
UTF-8 is the first byte of a two bytes UTF-8 character. It should have a 2nd
byte in the range of 0x80 to 0xBF
UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
....
Here is the od dump of that page-
D:\>od --format=x1 s*.htm
0000000 3c 48 54 4d 4c 3e 0d 0d 0a 3c 48 45 41 44 3e 0d
0000020 0d 0a 3c 4d 45 54 41 20 48 54 54 50 2d 45 51 55
0000040 49 56 3d 22 43 6f 6e 74 65 6e 74 2d 54 79 70 65
0000060 22 20 43 4f 4e 54 45 4e 54 3d 22 74 65 78 74 2f
0000100 68 74 6d 6c 3b 20 63 68 61 72 73 65 74 3d 75 74
0000120 66 2d 38 22 3e 0d 0d 0a 3c 54 49 54 4c 45 3e d1
0000140 3c 2f 54 49 54 4c 45 3e 0d 0d 0a 3c 2f 48 45 41
0000160 44 3e 0d 0d 0a 3c 42 4f 44 59 3e 0d 0d 0a 68 65
0000200 6c 6c 6f 20 77 6f 72 6c 64 2e 0d 0d 0a 3c 2f 42
0000220 4f 44 59 3e 0d 0d 0a 3c 2f 48 54 4d 4c 3e 0d 0d
0000240 0a
0000241
D:\>od --format=c s*.htm
0000000 < H T M L > \r \r \n < H E A D > \r
0000020 \r \n < M E T A H T T P - E Q U
0000040 I V = " C o n t e n t - T y p e
0000060 " C O N T E N T = " t e x t /
0000100 h t m l ; c h a r s e t = u t
0000120 f - 8 " > \r \r \n < T I T L E > 321
0000140 < / T I T L E > \r \r \n < / H E A
0000160 D > \r \r \n < B O D Y > \r \r \n h e
0000200 l l o w o r l d . \r \r \n < / B
0000220 O D Y > \r \r \n < / H T M L > \r \r
0000240 \n
Notice it is a singel 0xd1 there, which make the converter think the next '<' is
the 2nd byte of the UTF-8 and make the whole document in the <TITLE> part.
Mark this bug invalid since the test case itself is not valid UTF-8.
Status: NEW → RESOLVED
Closed: 25 years ago
Resolution: --- → INVALID
Comment 7•25 years ago
|
||
Reopening.
ftang@netscape.com: I'm sorry for the invalid testcase. Please ignore it.
I'm not an expert in character encodings, but since the document at the
originally reported URL was generated by MS Word, I doubt that it has the
same error than my testcase.
I'm only a prescreener, and since I could reproduce the problem, I confirmed
it to avoid unneccessary delay.
Please reevaluate this bug as if I had not added any comments.
I promise I won't create testcases for Internationalization bugs
in the future.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Assignee | ||
Comment 8•25 years ago
|
||
mark it as assign M16
Status: REOPENED → NEW
Target Milestone: --- → M16
Assignee | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 9•25 years ago
|
||
The problem is that file is NOT in UTF-8. I run through
http://people.netscape.com/ftang/utf8/isutf8.html and it said it is NOT valid
UTF-8.
Assignee | ||
Comment 10•25 years ago
|
||
at least line 7 is not valid UTF-8
C:\>od --format=x1 tt
0000000 3c 4d 45 54 41 20 4e 41 4d 45 3d 22 4b 65 79 77
0000020 6f 72 64 73 22 20 43 4f 4e 54 45 4e 54 3d 22 50
0000040 68 69 6c 6f 73 68 6f 70 79 2c 20 f4 e8 eb ee f1
0000060 ee f4 e8 ff 2c 20 d0 ee e7 e0 ed ee e2 22 3e 0d
0000100 0d 0a
0000102
C:\>od -c tt
0000000 < M E T A N A M E = " K e y w
0000020 o r d s " C O N T E N T = " P
0000040 h i l o s h o p y , 364 350 353 356 361
0000060 356 364 350 377 , 320 356 347 340 355 356 342 " > \r
0000100 \r \n
0000102
Notice the first non ASCII in line 7 is 0xF4
According to UTF-8 (http://www.cis.ohio-state.edu/htbin/rfc/rfc2279.html), 0xF4
is a lead byte of a 5 bytes UTF-8. The rest 4 bytes should be in the range of
0x80-0xbf
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
However, the 4 bytes followed 0xf4 are "e8 eb ee f1", not in the range of UTF-8
Mark this bug as invalid. It is the page itself buggy, not the mozilla.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago → 25 years ago
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•