Closed Bug 8275 Opened 25 years ago Closed 22 years ago

Need routine to perform Unicode composition and decomposition

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla1.2alpha

People

(Reporter: ftang, Assigned: shanjian)

References

Details

(Keywords: intl)

Attachments

(2 files, 7 obsolete files)

I think we need this to support Vietnamese. I am not sure this is a high priority item.
Status: NEW → ASSIGNED
Target Milestone: M11
Mark M11 since we may not need it.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → LATER
I don't think we really need this for SeaMonkey. Mark it LATER
QA Contact: teruko → ftang
reopening and marking FUTURE...
Status: RESOLVED → REOPENED
Resolution: LATER → ---
Target Milestone: M11 → Future
Status: REOPENED → ASSIGNED
nhotta say he need this. can we steal it from ICU ?
Target Milestone: Future → ---
shanjian, can you implement a normalizer and compose / decompose code in mozilla? Maybe we can port the ICU code or write our own.
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Blocks: 12063
Summary: Need routne to perform Unicode composition and decomposition → Need routine to perform Unicode composition and decomposition
Blocks: 156362
Blocks: 157673
we need this for a lot of work in m1.2 1. idns implementation 2. turn on MacOS X sorting 3. Mac OS X file system code
Keywords: intl, nsbeta1+
Target Milestone: --- → mozilla1.2alpha
take a look at glib http://www.mit.edu/afs/sipb/project/mono/src/glib-2.0.3/glib/ for example http://www.mit.edu/afs/sipb/project/mono/src/glib-2.0.3/glib/gunicode.h /* Compute canonical ordering of a string in-place. This rearranges decomposed characters in the string according to their combining classes. See the Unicode manual for more information. */ void g_unicode_canonical_ordering (gunichar *string, gsize len); /* Compute canonical decomposition of a character. Returns g_malloc()d string of Unicode characters. RESULT_LEN is set to the resulting length of the string. */ gunichar *g_unicode_canonical_decomposition (gunichar ch, gsize *result_len); gchar *g_utf8_normalize (const gchar *str, gssize len, GNormalizeMode mode); http://www.mit.edu/afs/sipb/project/mono/src/glib-2.0.3/glib/gunidecomp.c
I investigate the code in jpnic, it is very clear and self contained. I plan to use that code for this bug.
Status: NEW → ASSIGNED
Attached patch patch (not include data file) (obsolete) (deleted) — Splinter Review
Attached patch patch (data file only) (obsolete) (deleted) — Splinter Review
frank, naoki, Could you give some feedback, especially to the interface design and module placement?
+ NS_IMETHOD NormalizeUnicode( nsUnicodeNormForm aNormForm, + const nsAString& aSrc, PRUnichar** aDest) = 0; How about using nsAString for aDest also? I think nsNormalization.cpp should only contain xpcom wrapper. Then the acutal functions to be located in a separate file and make it callable as a C function.
*** Bug 12063 has been marked as a duplicate of this bug. ***
naoki, could you review my patch?
Blocks: 112979
nsINormalization.h - 1) for the return value, NS_ERROR_UNORM_MOREOUTPUT and NS_SUCCESS_UNORM_NOTFOUND are those returned to the caller? 2) the output buffer has to be provided by the caller please add comment about how the caller can estimate the amount of memory Makefile.in - 1) we need .xml file change for Mac, I can do that later nsNormalization.cpp - 1) indentation has to be consistent (do we want space instead of tab?) 2) PRInt32 do_composition, PRInt32 compat used by mdn_normalize() and other function, can they be PRBool? 3) need comments for korean constants +#define SBase 0xac00 +#define LBase 0x1100 +#define VBase 0x1161 +#define TBase 0x11a7 +#define LCount 19 +#define VCount 21 +#define TCount 28 +#define SLast (SBase + LCount * VCount * TCount) 4) mdn_normalize(), at the begining of the function, input is converted from UTF-16 to UCS4 I think we want to use more descriptive name to indicate that is UCS4 instead of simply 'c'. + while (start != end) { + PRUint32 c; 5) decompose(), I think this is different from NS_ERROR_OUT_OF_MEMORY, it can assert and return NS_ERROR_FAILURE, or can the caller try again with a smaller string to avoid this? + if (wb->size > WORKBUF_SIZE_MAX) { + // "mdn__unormalize_form*: " "working buffer too large\n" + return (NS_ERROR_OUT_OF_MEMORY); 6) NS_NewNormalization(), is this needed?
nsINormalization.h - 1) for the return value, NS_ERROR_UNORM_MOREOUTPUT and NS_SUCCESS_UNORM_NOTFOUND are those returned to the caller? Yes. 2) the output buffer has to be provided by the caller please add comment about how the caller can estimate the amount of memory NO, the buffer is allocated by callee. This has been commented in nsINormalization.h. nsNormalization.cpp - 1) indentation has to be consistent (do we want space instead of tab?) Done. I make it consistent locally. Their tab is not changed. 2) PRInt32 do_composition, PRInt32 compat used by mdn_normalize() and other function, can they be PRBool? Done. 3) need comments for korean constants +#define SBase 0xac00 +#define LBase 0x1100 +#define VBase 0x1161 +#define TBase 0x11a7 +#define LCount 19 +#define VCount 21 +#define TCount 28 +#define SLast (SBase + LCount * VCount * TCount) Those values were taken from unicode book. I marked it so. 4) mdn_normalize(), at the begining of the function, input is converted from UTF-16 to UCS4 I think we want to use more descriptive name to indicate that is UCS4 instead of simply 'c'. + while (start != end) { + PRUint32 c; I would like to keep to existing code. This way it is much easier for us to sync with jpnic. Those are not my code. 5) decompose(), I think this is different from NS_ERROR_OUT_OF_MEMORY, it can assert and return NS_ERROR_FAILURE, or can the caller try again with a smaller string to avoid this? + if (wb->size > WORKBUF_SIZE_MAX) { + // "mdn__unormalize_form*: " "working buffer too large\n" + return (NS_ERROR_OUT_OF_MEMORY); Done. 6) NS_NewNormalization(), is this needed? Removed.
Attached patch update patch (code only) (obsolete) (deleted) — Splinter Review
Attachment #95123 - Attachment is obsolete: true
r=nhotta > 1) for the return value, NS_ERROR_UNORM_MOREOUTPUT and NS_SUCCESS_UNORM_NOTFOUND > are those returned to the caller? > Yes. Then please add comment about the errors in .idl file.
Naoki, > 1) for the return value, NS_ERROR_UNORM_MOREOUTPUT and NS_SUCCESS_UNORM_NOTFOUND > are those returned to the caller? The correct answer should be NO. The only possible error message caller can receive is NS_ERROR_FAILURE and NS_ERROR_OUT_OF_MEMORY. I put them into nsINormalization.h.
There are two file which do not exist anymore. RCS file: /cvsroot/mozilla/intl/unicharutil/src/Attic/nsUcharUtilModule.cpp RCS file: /cvsroot/mozilla/intl/unicharutil/src/Attic/makefile.win for the module file, do you need to change somewhere else?
Attachment #100929 - Attachment is obsolete: true
please add .xml and MANIFEST change for Mac build
Attached patch mac build patch (obsolete) (deleted) — Splinter Review
Attached patch combine 2 code patch. (obsolete) (deleted) — Splinter Review
Attachment #101157 - Attachment is obsolete: true
Attachment #101161 - Attachment is obsolete: true
Comment on attachment 101168 [details] [diff] [review] combine 2 code patch. r=nhotta
Attachment #101168 - Flags: review+
shanjian, please send a mail to staff@mozilla.org and ask for review of the license.
alec, could you sr=?
Attachment #101168 - Flags: needs-work+
Comment on attachment 101168 [details] [diff] [review] combine 2 code patch. nsINormalization is the wrong name for this interface. How about nsIUnicodeNormalizer or something? What exactly does it mean to normalize? It looks like we're converting from UTF-16 into something else but I'm not sure. Should this/can this be in IDL? looks like it can and should to me. IDL constants should be in CAPS_WITH_UNDERSCORES, and be verbose, like CANONICAL_DECOMPOSITION and so forth. "kNFKD" is not intuitive anyway. Why is the result of NormalizeUnicode() a PRUnichar** instead of an nsAString&? that would avoid extra allocations. you don't have to #include nsISupports to define a CID, so lets remove that from nsNormalizationCID.h I'm not sure about the license issue, I'm assuming you got it right. I need to spend more time looking at this patch, but those are some larger issues that need to be fixed so far. I'll continue to review later today under the assumption that this will get a better name, and the above issues will be addressed.
Attached patch update with alecf's suggestion (obsolete) (deleted) — Splinter Review
Attachment #101168 - Attachment is obsolete: true
Comment on attachment 101318 [details] [diff] [review] update with alecf's suggestion carry over review.
Attachment #101318 - Flags: review+
Attached patch update data to unicode 3.20 (deleted) — Splinter Review
Comment on attachment 102224 [details] [diff] [review] update data to unicode 3.20 you've got double {{ / }} in these giant tables. Isn't there a better way of declaring these tables so the results aren't so cryptic? can we get some comments to indicate what each table does?
Comment on attachment 101318 [details] [diff] [review] update with alecf's suggestion sorry for taking so long. we need WAY more comments here. What are IDX0? IDX_0? BITS1? How do these lookup tables work? Also, your spacing is way off - you're mixing tabs and spaces (there should be NO tabs in source files) constants should not be declared with %{C++ instead you should use interface constants. For instance, put "const long UNICODE_NFD = 1;" in your interface. Another question I have is, why does nsIUnicodeNormalizer have one method, which takes a "aNormForm" as a parameter, and then just does a switch() statement based on that? Instead, nsIUnicodeNormalizer should have 4 methods which call mdn_normalize with the right parameters each time. (i.e. NormalizeFromNFD, NormalizeFromNFC, etc With that change, I don't even think you'll need the UNICODE_NFD/etc #defines.
Attachment #101318 - Flags: needs-work+
>> you've got double {{ / }} in these giant tables. Isn't there a better way of >> declaring these tables so the results aren't so cryptic? can we get some >> comments to indicate what each table does? Those table were copied from jpnic source directly with very little modification. We want to keep as less change as possible because the table might be updated periodically, just as I did today. >>we need WAY more comments here. What are IDX0? IDX_0? BITS1? How do these >>lookup tables work? >>Also, your spacing is way off - you're mixing tabs and spaces (there should be >>NO tabs in source files) If you use the standard of mozilla source code, then I have to rewrite everything. Everyline start with tab was copied from their file, I tried to avoid any kind of modificatin if possible, include naming, comment, whatsoever. Just by looking at the naming of id and indentation, my code can be differenciated from jpnic original code. >>constants should not be declared with %{C++ >>instead you should use interface constants. For instance, put "const long >>UNICODE_NFD = 1;" in your interface. I can certainly make such change. >>Another question I have is, why does nsIUnicodeNormalizer have one method, >>which takes a "aNormForm" as a parameter, and then just does a switch() >>statement based on that? Instead, nsIUnicodeNormalizer should have 4 methods >>which call mdn_normalize with the right parameters each time. (i.e. >>NormalizeFromNFD, NormalizeFromNFC, etc >>With that change, I don't even think you'll need the UNICODE_NFD/etc #defines. This convention originated from unicode standard documentation and their sample implemenation. For all the implementation I investigated, they all use this convention. I personaly don't think one way or the other have big advantage.
the advantage is smaller, faster code. Consider this simplified example: void DoAction(int action) { switch (action) { case 1: doAction1(); break; case 2: doAction2(); break; case 3: doAction3(); break; } } but all the callers always hardcode the last parameter: DoAction(1); DoAction(2); wouldn't it be much simpler to simply leave out DoAction() altogether and instead call: doAction1(); doAction2(); The code is smaller and faster because you're skipping decisions that have already been made by the caller. The only reason you should be using a switch statement is when the value passed in is dynamic and the caller doesn't actually know which conversion needs to be done. However, I don't see the value being passed in anywhere.. I can only assume it isn't dynamic. As for the tabs/spaces, do you need to take diffs from future versions of the source? I'm concerned that future maintenance on this file will result in a mix of tabs and spaces.
Attachment #95124 - Attachment is obsolete: true
Attachment #101318 - Attachment is obsolete: true
>>As for the tabs/spaces, do you need to take diffs from future versions of the >>source? I'm concerned that future maintenance on this file will result in a >>mixof tabs and spaces. I think so. If they fixed certain bugs, we might want to populate to our code easily through applying the patch. Answer 2 of your questions. >>What exactly does it mean to normalize? It >>looks like we're converting from UTF-16 into something else but I'm not sure. We are converting from UTF-16 to UTF-16. In terms of encoding, there is no change. Certain character can be represented in unicode in more than one form. This practice is derive from legacy encoding support. If those different representation forms are not normalized, certain operation like string comparison does not make much sense. >>CANONICAL_DECOMPOSITION and so forth. "kNFKD" is not intuitive anyway. This NFC, NFK, ... have become the jargon in unicode world. This is very similar to word like UTF-16. People who use this function should understand what it means. Some people may not realize what NFKC (Compatibility Decomposition, followed by Canonical Composition) really is when they are using the form.
Comment on attachment 102252 [details] [diff] [review] interface change, each form now has its own function. alec, could you sr this? shanjian put a new patch and answered your questions. I am trying to land my IDN patch which depends on this bug.
Attachment #102252 - Flags: superreview?(alecf)
Comment on attachment 102252 [details] [diff] [review] interface change, each form now has its own function. sr=alecf - thanks for the cleanup, and documentation.
Attachment #102252 - Flags: superreview?(alecf) → superreview+
Shanjian, Naoki wants to land his stuff before 1.3b. Please check this in soon. Thx.
fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago22 years ago
Resolution: --- → FIXED
No longer blocks: 157673
Depends on: 180372
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: