Closed
Bug 32976
Opened 25 years ago
Closed 25 years ago
Korean line breaking rules should be changed
Categories
(Core :: Layout, defect, P3)
Core
Layout
Tracking
()
VERIFIED
FIXED
People
(Reporter: jshin, Assigned: ftang)
References
()
Details
Attachments
(2 files)
Although bug 27062 and 26734 mention about applying CJK line breaking rules available to mail/news message rendering, CJK line breaking rules doesn't seem to be in place EVEN for web page rendering. If it were in place, lines would be BROKEN at any ideographic boundaries(in case of CJ) and any syllabic boundaries (in case of K) AS WELL AS at space. However, as of 2000-03-21, lines in Korean web pages are broken ONLY at space (just like in Latin text).
Reporter | ||
Comment 2•25 years ago
|
||
Comment 3•25 years ago
|
||
Frank, are you familiar with the CJK line breaking stuff?
Assignee: erik → ftang
Reporter | ||
Comment 4•25 years ago
|
||
Try to adjust the width of the browser window to maximize the difference between the case with <wbr> inserted at every syllable boundary and the case without <wbr>.
Reporter | ||
Comment 5•25 years ago
|
||
Assignee | ||
Comment 6•25 years ago
|
||
The current line break algorithm implement JIS x4501 standard + approximate Thai
breaking rule which contributed from Thailand.
The difficulty of support correct Korean line breaking rule is there are NO
formal spec that we can follow. The information you include is too abstract and
not easy to udnerstand. For example, you say "lines would be BROKEN at...
syllabic boundaries (in case of K)". 1) Is there a standard specify that ? 2)
how you define syllabic boundaries, in term of unicode code point ?
>the screenshot of NS 4.7(with incorrect line breaking). Mozilla does exactly
the same.
Yes, the problem is we implement what "we believe is correct". In other word,
the problem is not we have a implementation problem there, but a design problem
there.
To correct the error, you have to educate us what is the "correct" in your mind.
Also, we have to be careful that might introduce incompatability w/ 4.x
I am not quite sure what your perl script do. It looks like it add a <wbr> after
any characters.
Do you mean we should treat Hangul the say way as CJK ideograph. In other word,
do you mean U+AC00 - U+D7A3 should behave the same way as U+4E00-U+9FAF ?
Status: NEW → ASSIGNED
Assignee | ||
Comment 7•25 years ago
|
||
Change the summary to "Korean line breaking rules should be changed"
Summary: CJK line breaking rules does NOT seem to be in place. → Korean line breaking rules should be changed
Assignee | ||
Comment 8•25 years ago
|
||
Ok... I found some reference- Developing International Software For Windows 95 and Windows NT, Nadine Kano, Microsoft Press, ISBN-1-55615-840-8, pp 244, Dividing Lines of Text in Korean: "Korean words expressed in hangul are separated by spaces, as they are in Western languages. Some Korean-language applicatoins allow the user to choose whether or not to break lines between hangul characters. This example breaks lines only between words. HANGUL English HANGUL HANGUL The example below breaks lines between individual hangul characters. HANGUL English HANGUL HAN GUL The standard rule for breaking lines between hangul characters, called geumchik is very similar to the Japanese kinsoku rule- you can break lines between any two characters, with the following exceptions. A line of text cannot end with any leading characters. (Character are show with their hexadecimal code point for Korean standard code, KSC 5601) .... A line of text cannot begin with any following characters, listed below: ... The geumchik rule defines three methods for dealing with following characters, the first method, the JalLaNaeGi method, breaks the line before the first character to the left of the following character, as shown below: THESE ARE HANGUL CHARACTER| S. | The MilEoNuGi method breaks the line after the following character and compresses the text that falls before it, as shown below: THESE ARE HANGUL CHARACTERS.| The GeuNyangDuGi method extends the right margin slightly to accommodate the following character, as shown below: THESE ARE HANGUL CHARACTERS|. This method can als extend the bottom margin. There is no special category for overflow characters in Korean. " I cannot find any word about Korean line break in Ken Lunde's CJKV Information Processing. jshin- Is the "The example below breaks lines between individual hangul characters." in Nardin's book the one you ask for here ? If your answer is yes, then the following patch should fix it for you. Can you build and try ?
Assignee | ||
Comment 9•25 years ago
|
||
Z:\mozilla\intl\lwbrk\src>cvs diff -c nsJIS*.cpp Index: nsJISx4501LineBreaker.cpp =================================================================== RCS file: /m/pub/mozilla/intl/lwbrk/src/nsJISx4501LineBreaker.cpp,v retrieving revision 1.20 diff -c -r1.20 nsJISx4501LineBreaker.cpp *** nsJISx4501LineBreaker.cpp 2000/01/13 23:26:21 1.20 --- nsJISx4501LineBreaker.cpp 2000/03/23 18:17:03 *************** *** 232,237 **** --- 232,238 ---- { c = GETCLASSFROMTABLE(gLBClass30, l); } else if (( ( 0x3200 <= h) && ( h <= 0x9fff) ) || // Unicode 3.0 + ( ( 0xAC00 <= h) && ( h <= 0xD7FF) ) || // Hangul ( ( 0xf900 <= h) && ( h <= 0xfaff) ) ) {
Reporter | ||
Comment 10•25 years ago
|
||
Absolutely !!. Syllable boundaries are just Unicode code point boundaries as far as precomposed Hangul syllables are concerned. That is, 0XAC00-0XD7A3 should be treated the same way as Hanja/Kanji/Kanji. As for Hangul made up of U1100 Jamos, details are available in Unicode 3.0 book. As for Kano's book, just disregard prohibition rules he mentioned for the moment. I don't know what Hangul syllables that he wrote cannot begin or end lines. EVen if there are such characters, it's much more important to let Mozilla break between any Hangul syllables now and take care of them later. Only prohibition rules I can think of is NOT between Hangul syllables BUT about some punctuation marks (as implemented in a rudimentary way by my perl script).
Reporter | ||
Comment 11•25 years ago
|
||
Kano's book is absolutely WRONG in saying "Some Korean-language applications allow the user to choose whether or not to break line between Hangul characters". NO SANE author of Korean word processors/type setting programs would do that. As for introducing incompatibility with NS 4.x, it should be no concern as it's just correcting what's been wroing in NS 4.x.
Reporter | ||
Comment 12•25 years ago
|
||
I applied your patch and rebuilt it. Now my sample page(of which URL is given above) renders exactly the same whether or not I inserted <wbr> between every pair of syllables. Could you please check this in? I can assure you that this is the RIGHT way !!
Assignee | ||
Comment 13•25 years ago
|
||
>I don't know what Hangul syllables that he wrote cannot begin or end lines. You don't have nardin's book, do you? The list of characters he listed in not Hangul but some ASCII symbol and some Korean Symbol (in single byte range and *some* code point in A1A1-A3FF range) Read http://msdn.microsoft.com/library/books/devintl/S24B6_L3.HTM for the online vesion of Nadin's section. >Could you please check this in? Will check in to the tip (not beta1 branch sorry) after the tree open this afternoon.
Reporter | ||
Comment 14•25 years ago
|
||
What my perl script does is the following(as I wrote on the Unicode List) if you're still curious. Because of the way Korean Hangul syllables are encoded in Unicode, Hangul syllable boundaries are just Unicode code point boundaries as far as precomposed Hangul syllables (UAC00-UDxxx) are concerned. 1) can be broken at any syllabic boundaries 2) can be broken at space(this arguablely is included in rule 1) 3) Do not end lines with a certain set of punctuation marks ; opening single/double quotation marks, opening brace/braket/parenthesis.... 4) Do not begin lines with a certain set of punctuation marks : cloing single/double quotation marks, closing brace/braket/parenthesis, question mark, exclamation mark, semicolon, colon, period, comma.... Rule #3 and #4 correspond to prohibition rules in Kano's book, I believe. Taking a second look at your excerpt of his book about prohibition rules, he doesn't seem to have written that there are some *Hangul* syllables that canNOT begin or end lines. His list of punctuation marks and similar(symbols...) that cannot begin or end lines (in his example, he's talking about '.' (period)) may as well be more extensive than my list above, but two list should basically convey the same 'spirit'. Anyway, rule #3 and #4 are, I believe, already taken care of by Mozilla (not just for East Asian but also for Latin text) and your patch(thank you ! I should have looked at the source) would fill the last missing part except for certain fine points which can be dealt with later.
Reporter | ||
Comment 15•25 years ago
|
||
>> I don't know what Hangul syllables that he wrote cannot begin or >> end lines. > You don't have nardin's book, do you? The list of characters he listed in not > Hangul but some ASCII symbol and some Korean Symbol (in single byte range and > *some* code point in A1A1-A3FF range) > Read http://msdn.microsoft.com/library/books/devintl/S24B6_L3.HTM > for the online vesion of Nadin's section. That's what I expected(read my last comment about my perl script which crossed with your comment in the middle) and what I have been telling you all along. Those chacters are basically the same characters that CANNOT end or begin lines in English text either. (pls note that most of them are just full-width version of US-ASCII counter part). His list is not complete in that it doesn't have '?'(US-ASCII) in the list of characters that cannot begin lines while the full width counter part is included. >> Could you please check this in? > Will check in to the tip (not beta1 branch sorry) after the tree open this > afternoon. Hey, come on. Your patch is 100% correct(as far as Hangul precomposed syllables are concerned) and please do not extend the life of the wrong any more. Well, it's up to you, but I'd check it in beta 1 branch as well as in the tip.
Assignee | ||
Comment 16•25 years ago
|
||
fix and check in.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•