Closed Bug 32976 Opened 25 years ago Closed 25 years ago

Korean line breaking rules should be changed

Categories

(Core :: Layout, defect, P3)

defect

Tracking

()

VERIFIED FIXED

People

(Reporter: jshin, Assigned: ftang)

References

()

Details

Attachments

(2 files)

Although bug 27062 and 26734 mention about applying
CJK line breaking rules available to mail/news
message rendering, CJK line breaking rules
doesn't seem to be in place EVEN for web page rendering. 
If it were in place, lines would be BROKEN at
any ideographic boundaries(in case of CJ) and
any syllabic boundaries (in case of K) AS WELL
AS at space. However,
as of 2000-03-21, lines in Korean web pages 
are broken ONLY at space (just like in Latin text).
Reassigning to Erik
Assignee: troy → erik
Frank, are you familiar with the CJK line breaking stuff?
Assignee: erik → ftang
Try to adjust the width of the browser window to
maximize the difference between the case with
<wbr> inserted at every syllable boundary and
the case without <wbr>.
The current line break algorithm implement JIS x4501 standard + approximate Thai 
breaking rule which contributed from Thailand. 
The difficulty of support correct Korean line breaking rule is there are NO 
formal spec that we can follow. The information you include is too abstract and 
not easy to udnerstand. For example, you say "lines would be BROKEN at... 
syllabic boundaries (in case of K)". 1) Is there a standard specify that ? 2) 
how you define syllabic boundaries, in term of unicode code point ?

>the screenshot of NS 4.7(with incorrect line breaking). Mozilla does exactly 
the same.
Yes, the problem is we implement what "we believe is correct". In other word, 
the problem is not we have a implementation problem there, but a design problem 
there. 
To correct the error, you have to educate us what is the "correct" in your mind. 
Also, we have to be careful that might introduce incompatability w/ 4.x

I am not quite sure what your perl script do. It looks like it add a <wbr> after  
any characters. 

Do you mean we should treat Hangul the say way as CJK ideograph. In other word, 
do you mean U+AC00 - U+D7A3 should behave the same way as U+4E00-U+9FAF ?
Status: NEW → ASSIGNED
Change the summary to "Korean line breaking rules should be changed"
Summary: CJK line breaking rules does NOT seem to be in place. → Korean line breaking rules should be changed
Ok... I found some reference-
Developing International Software For Windows 95 and Windows NT, Nadine Kano, 
Microsoft Press, ISBN-1-55615-840-8, pp 244, Dividing Lines of Text in Korean:
"Korean words expressed in hangul are separated by spaces, as they are in 
Western languages. Some Korean-language applicatoins allow the user to choose 
whether or not to break lines between hangul characters.
This example breaks lines only between words.
HANGUL English HANGUL
HANGUL

The example below breaks lines between individual hangul characters.

HANGUL English HANGUL HAN
GUL

The standard rule for breaking lines between hangul characters, called geumchik 
is very similar to the Japanese kinsoku rule- you can break lines between any 
two characters, with the following exceptions. A line of text cannot end with 
any leading characters. (Character are show with their hexadecimal code point 
for Korean standard code, KSC 5601) 
....
A line of text cannot begin with any following characters, listed below:
...
The geumchik rule defines three methods for dealing with following characters, 
the first method, the JalLaNaeGi method, breaks the line before the first 
character to the left of the following character, as shown below:
THESE ARE HANGUL CHARACTER|
S.                        |

The MilEoNuGi method breaks the line after the following character and 
compresses the text that falls before it, as shown below:

THESE ARE HANGUL CHARACTERS.|

The GeuNyangDuGi method extends the right margin slightly to accommodate the 
following character, as shown below:
THESE ARE HANGUL CHARACTERS|.

This method can als extend the bottom margin. 
There is no special category for overflow characters in Korean. "

I cannot find any word about Korean line break in Ken Lunde's CJKV Information 
Processing.

jshin- Is the "The example below breaks lines between individual hangul 
characters." in Nardin's book the one you ask for here ? If your answer is yes, 
then the following patch should fix it for you. Can you build and try ?
Z:\mozilla\intl\lwbrk\src>cvs diff -c nsJIS*.cpp
Index: nsJISx4501LineBreaker.cpp
===================================================================
RCS file: /m/pub/mozilla/intl/lwbrk/src/nsJISx4501LineBreaker.cpp,v
retrieving revision 1.20
diff -c -r1.20 nsJISx4501LineBreaker.cpp
*** nsJISx4501LineBreaker.cpp   2000/01/13 23:26:21     1.20
--- nsJISx4501LineBreaker.cpp   2000/03/23 18:17:03
***************
*** 232,237 ****
--- 232,238 ----
     {
       c = GETCLASSFROMTABLE(gLBClass30, l);
     } else if (( ( 0x3200 <= h) && ( h <= 0x9fff) ) || // Unicode 3.0
+               ( ( 0xAC00 <= h) && ( h <= 0xD7FF) ) || // Hangul
                ( ( 0xf900 <= h) && ( h <= 0xfaff) )
               )
     {
Absolutely !!. Syllable boundaries are just Unicode code point
boundaries as far as precomposed Hangul syllables are concerned.
That is, 0XAC00-0XD7A3 should be treated the same way
as Hanja/Kanji/Kanji. As for Hangul made up of U1100 Jamos,
details are available in Unicode 3.0 book.
As for Kano's book, just disregard prohibition rules he mentioned for the moment.
I don't know what Hangul syllables that he wrote cannot begin or
end lines. EVen if there are such characters, it's much more important
to let Mozilla break between any Hangul syllables now and take care
of them later. Only prohibition rules I can think of is NOT between
Hangul syllables BUT about some punctuation marks (as implemented
in a rudimentary way by my perl script).
Kano's book is absolutely WRONG in saying
 "Some Korean-language applications allow the user to choose
 whether or not to break line between Hangul characters".
NO SANE author of Korean word processors/type setting programs
would do that.
As for introducing incompatibility with NS 4.x, it should be no
concern as it's just correcting what's been wroing in NS 4.x.
I applied your patch and rebuilt it. Now my sample page(of
which URL is given above) renders exactly the same whether or
not I inserted <wbr> between every pair of syllables.
Could you please check this in? I can assure you that this is
the RIGHT way !!
>I don't know what Hangul syllables that he wrote cannot begin or
end lines.
You don't have nardin's book, do you? The list of characters he listed in not 
Hangul but some ASCII symbol and some Korean Symbol (in single byte range and 
*some* code point in A1A1-A3FF range) 
Read http://msdn.microsoft.com/library/books/devintl/S24B6_L3.HTM 
for the online vesion of Nadin's section.
>Could you please check this in?
Will check in to the tip (not beta1 branch sorry) after the tree open this 
afternoon.

    What my perl script does is the following(as I wrote
on the Unicode List) if you're still curious. Because of the way Korean
Hangul syllables are  encoded in Unicode, Hangul syllable boundaries are
just Unicode code point boundaries as far as precomposed Hangul syllables
(UAC00-UDxxx) are concerned.


  1) can be broken at any syllabic boundaries

  2) can be broken at space(this arguablely is included in  rule 1)

  3) Do not end lines with a certain set of punctuation marks
     ; opening single/double quotation marks, opening
       brace/braket/parenthesis....

  4) Do not begin lines with a certain set of punctuation marks
     : cloing single/double quotation marks, closing
       brace/braket/parenthesis, question mark, exclamation mark,
       semicolon, colon, period, comma....

  Rule #3 and #4 correspond to prohibition rules in Kano's book, I
believe. Taking a second look at your excerpt of his book about
prohibition rules, he doesn't seem to have written that there are
some *Hangul* syllables that canNOT begin or end lines. His list of
punctuation marks and similar(symbols...) that cannot begin or end lines
(in his example, he's talking about '.' (period)) may as well be more
extensive than my list above, but two list should basically convey the
same 'spirit'. Anyway, rule #3 and #4 are, I believe, already taken care
of by Mozilla (not just for East Asian but also for Latin text) and your
patch(thank you ! I should have looked at the source) would fill the last
missing part except for certain fine points which can be dealt with later.
>> I don't know what Hangul syllables that he wrote cannot begin or
>> end lines.
> You don't have nardin's book, do you? The list of characters he listed in not
> Hangul but some ASCII symbol and some Korean Symbol (in single byte range and
> *some* code point in A1A1-A3FF range)
> Read http://msdn.microsoft.com/library/books/devintl/S24B6_L3.HTM
> for the online vesion of Nadin's section.

   That's what I expected(read my last comment about
my perl script which crossed with your comment in the middle) and what I
have been telling you all along.  Those chacters are basically the same
characters that CANNOT end or begin lines in English text either. (pls
note that most of them are just full-width version of US-ASCII counter
part). His list is not complete in that it doesn't have '?'(US-ASCII)
in the  list of characters that cannot begin lines while the full width
counter part is included.


>> Could you please check this in?

> Will check in to the tip (not beta1 branch sorry) after the tree open this
> afternoon.

  Hey, come on. Your patch is 100% correct(as far as Hangul
precomposed syllables are concerned) and please do not extend the
life of  the wrong any more. Well, it's up to you, but
I'd check it in beta 1 branch as well as in the tip.
fix and check in.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Marking verfied per last comments.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: