Closed
Bug 8275
Opened 25 years ago
Closed 22 years ago
Need routine to perform Unicode composition and decomposition
Categories
(Core :: Internationalization, defect, P3)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
mozilla1.2alpha
People
(Reporter: ftang, Assigned: shanjian)
References
Details
(Keywords: intl)
Attachments
(2 files, 7 obsolete files)
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
alecf
:
superreview+
|
Details | Diff | Splinter Review |
I think we need this to support Vietnamese. I am not sure this is a high
priority item.
Reporter | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Target Milestone: M11
Reporter | ||
Comment 1•25 years ago
|
||
Mark M11 since we may not need it.
Reporter | ||
Updated•25 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → LATER
Reporter | ||
Comment 2•25 years ago
|
||
I don't think we really need this for SeaMonkey. Mark it LATER
Comment 3•24 years ago
|
||
reopening and marking FUTURE...
Status: RESOLVED → REOPENED
Resolution: LATER → ---
Target Milestone: M11 → Future
Reporter | ||
Updated•24 years ago
|
Status: REOPENED → ASSIGNED
Reporter | ||
Comment 4•22 years ago
|
||
nhotta say he need this. can we steal it from ICU ?
Target Milestone: Future → ---
Reporter | ||
Comment 5•22 years ago
|
||
shanjian, can you implement a normalizer and compose / decompose code in
mozilla? Maybe we can port the ICU code or write our own.
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Updated•22 years ago
|
Blocks: 12063
Summary: Need routne to perform Unicode composition and decomposition → Need routine to perform Unicode composition and decomposition
Reporter | ||
Comment 6•22 years ago
|
||
we need this for a lot of work in m1.2
1. idns implementation
2. turn on MacOS X sorting
3. Mac OS X file system code
Reporter | ||
Comment 7•22 years ago
|
||
take a look at glib
http://www.mit.edu/afs/sipb/project/mono/src/glib-2.0.3/glib/
for example
http://www.mit.edu/afs/sipb/project/mono/src/glib-2.0.3/glib/gunicode.h
/* Compute canonical ordering of a string in-place. This rearranges
decomposed characters in the string according to their combining
classes. See the Unicode manual for more information. */
void g_unicode_canonical_ordering (gunichar *string,
gsize len);
/* Compute canonical decomposition of a character. Returns g_malloc()d
string of Unicode characters. RESULT_LEN is set to the resulting
length of the string. */
gunichar *g_unicode_canonical_decomposition (gunichar ch,
gsize *result_len);
gchar *g_utf8_normalize (const gchar *str,
gssize len,
GNormalizeMode mode);
http://www.mit.edu/afs/sipb/project/mono/src/glib-2.0.3/glib/gunidecomp.c
Assignee | ||
Comment 8•22 years ago
|
||
I investigate the code in jpnic, it is very clear and self contained. I plan to
use that code for this bug.
Status: NEW → ASSIGNED
Assignee | ||
Comment 9•22 years ago
|
||
Assignee | ||
Comment 10•22 years ago
|
||
Assignee | ||
Comment 11•22 years ago
|
||
frank, naoki,
Could you give some feedback, especially to the interface design and module
placement?
Comment 12•22 years ago
|
||
+ NS_IMETHOD NormalizeUnicode( nsUnicodeNormForm aNormForm,
+ const nsAString& aSrc, PRUnichar** aDest) = 0;
How about using nsAString for aDest also?
I think nsNormalization.cpp should only contain xpcom wrapper. Then the acutal
functions to be located in a separate file and make it callable as a C function.
Assignee | ||
Comment 13•22 years ago
|
||
*** Bug 12063 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 14•22 years ago
|
||
naoki, could you review my patch?
Comment 15•22 years ago
|
||
nsINormalization.h -
1) for the return value, NS_ERROR_UNORM_MOREOUTPUT and NS_SUCCESS_UNORM_NOTFOUND
are those returned to the caller?
2) the output buffer has to be provided by the caller
please add comment about how the caller can estimate the amount of memory
Makefile.in -
1) we need .xml file change for Mac, I can do that later
nsNormalization.cpp -
1) indentation has to be consistent (do we want space instead of tab?)
2) PRInt32 do_composition, PRInt32 compat used by mdn_normalize() and
other function, can they be PRBool?
3) need comments for korean constants
+#define SBase 0xac00
+#define LBase 0x1100
+#define VBase 0x1161
+#define TBase 0x11a7
+#define LCount 19
+#define VCount 21
+#define TCount 28
+#define SLast (SBase + LCount * VCount * TCount)
4) mdn_normalize(), at the begining of the function,
input is converted from UTF-16 to UCS4
I think we want to use more descriptive name to indicate that is UCS4
instead of simply 'c'.
+ while (start != end) {
+ PRUint32 c;
5) decompose(), I think this is different from NS_ERROR_OUT_OF_MEMORY,
it can assert and return NS_ERROR_FAILURE,
or can the caller try again with a smaller string to avoid this?
+ if (wb->size > WORKBUF_SIZE_MAX) {
+ // "mdn__unormalize_form*: " "working buffer too large\n"
+ return (NS_ERROR_OUT_OF_MEMORY);
6) NS_NewNormalization(), is this needed?
Assignee | ||
Comment 16•22 years ago
|
||
nsINormalization.h -
1) for the return value, NS_ERROR_UNORM_MOREOUTPUT and NS_SUCCESS_UNORM_NOTFOUND
are those returned to the caller?
Yes.
2) the output buffer has to be provided by the caller
please add comment about how the caller can estimate the amount of memory
NO, the buffer is allocated by callee. This has been commented in
nsINormalization.h.
nsNormalization.cpp -
1) indentation has to be consistent (do we want space instead of tab?)
Done. I make it consistent locally. Their tab is not changed.
2) PRInt32 do_composition, PRInt32 compat used by mdn_normalize() and
other function, can they be PRBool?
Done.
3) need comments for korean constants
+#define SBase 0xac00
+#define LBase 0x1100
+#define VBase 0x1161
+#define TBase 0x11a7
+#define LCount 19
+#define VCount 21
+#define TCount 28
+#define SLast (SBase + LCount * VCount * TCount)
Those values were taken from unicode book. I marked it so.
4) mdn_normalize(), at the begining of the function,
input is converted from UTF-16 to UCS4
I think we want to use more descriptive name to indicate that is UCS4
instead of simply 'c'.
+ while (start != end) {
+ PRUint32 c;
I would like to keep to existing code. This way it is much easier for us to sync
with jpnic. Those are not my code.
5) decompose(), I think this is different from NS_ERROR_OUT_OF_MEMORY,
it can assert and return NS_ERROR_FAILURE,
or can the caller try again with a smaller string to avoid this?
+ if (wb->size > WORKBUF_SIZE_MAX) {
+ // "mdn__unormalize_form*: " "working buffer too large\n"
+ return (NS_ERROR_OUT_OF_MEMORY);
Done.
6) NS_NewNormalization(), is this needed?
Removed.
Assignee | ||
Comment 17•22 years ago
|
||
Attachment #95123 -
Attachment is obsolete: true
Comment 18•22 years ago
|
||
r=nhotta
> 1) for the return value, NS_ERROR_UNORM_MOREOUTPUT and NS_SUCCESS_UNORM_NOTFOUND
> are those returned to the caller?
> Yes.
Then please add comment about the errors in .idl file.
Assignee | ||
Comment 19•22 years ago
|
||
Naoki,
> 1) for the return value, NS_ERROR_UNORM_MOREOUTPUT and NS_SUCCESS_UNORM_NOTFOUND
> are those returned to the caller?
The correct answer should be NO. The only possible error message caller can
receive is NS_ERROR_FAILURE and NS_ERROR_OUT_OF_MEMORY. I put them into
nsINormalization.h.
Comment 20•22 years ago
|
||
There are two file which do not exist anymore.
RCS file: /cvsroot/mozilla/intl/unicharutil/src/Attic/nsUcharUtilModule.cpp
RCS file: /cvsroot/mozilla/intl/unicharutil/src/Attic/makefile.win
for the module file, do you need to change somewhere else?
Assignee | ||
Comment 21•22 years ago
|
||
Attachment #100929 -
Attachment is obsolete: true
Comment 22•22 years ago
|
||
please add .xml and MANIFEST change for Mac build
Assignee | ||
Comment 23•22 years ago
|
||
Assignee | ||
Comment 24•22 years ago
|
||
Attachment #101157 -
Attachment is obsolete: true
Attachment #101161 -
Attachment is obsolete: true
Comment 25•22 years ago
|
||
Comment on attachment 101168 [details] [diff] [review]
combine 2 code patch.
r=nhotta
Attachment #101168 -
Flags: review+
Reporter | ||
Comment 26•22 years ago
|
||
shanjian, please send a mail to staff@mozilla.org and ask for review of the
license.
Assignee | ||
Comment 27•22 years ago
|
||
alec, could you sr=?
Updated•22 years ago
|
Attachment #101168 -
Flags: needs-work+
Comment 28•22 years ago
|
||
Comment on attachment 101168 [details] [diff] [review]
combine 2 code patch.
nsINormalization is the wrong name for this interface. How about
nsIUnicodeNormalizer or something? What exactly does it mean to normalize? It
looks like we're converting from UTF-16 into something else but I'm not sure.
Should this/can this be in IDL? looks like it can and should to me. IDL
constants should be in CAPS_WITH_UNDERSCORES, and be verbose, like
CANONICAL_DECOMPOSITION and so forth. "kNFKD" is not intuitive anyway.
Why is the result of NormalizeUnicode() a PRUnichar** instead of an nsAString&?
that would avoid extra allocations.
you don't have to #include nsISupports to define a CID, so lets remove that
from nsNormalizationCID.h
I'm not sure about the license issue, I'm assuming you got it right.
I need to spend more time looking at this patch, but those are some larger
issues that need to be fixed so far.
I'll continue to review later today under the assumption that this will get a
better name, and the above issues will be addressed.
Assignee | ||
Comment 29•22 years ago
|
||
Attachment #101168 -
Attachment is obsolete: true
Assignee | ||
Comment 30•22 years ago
|
||
Comment on attachment 101318 [details] [diff] [review]
update with alecf's suggestion
carry over review.
Attachment #101318 -
Flags: review+
Assignee | ||
Comment 31•22 years ago
|
||
Comment 32•22 years ago
|
||
Comment on attachment 102224 [details] [diff] [review]
update data to unicode 3.20
you've got double {{ / }} in these giant tables. Isn't there a better way of
declaring these tables so the results aren't so cryptic? can we get some
comments to indicate what each table does?
Comment 33•22 years ago
|
||
Comment on attachment 101318 [details] [diff] [review]
update with alecf's suggestion
sorry for taking so long.
we need WAY more comments here. What are IDX0? IDX_0? BITS1? How do these
lookup tables work?
Also, your spacing is way off - you're mixing tabs and spaces (there should be
NO tabs in source files)
constants should not be declared with %{C++
instead you should use interface constants. For instance, put "const long
UNICODE_NFD = 1;" in your interface.
Another question I have is, why does nsIUnicodeNormalizer have one method,
which takes a "aNormForm" as a parameter, and then just does a switch()
statement based on that? Instead, nsIUnicodeNormalizer should have 4 methods
which call mdn_normalize with the right parameters each time. (i.e.
NormalizeFromNFD, NormalizeFromNFC, etc
With that change, I don't even think you'll need the UNICODE_NFD/etc #defines.
Attachment #101318 -
Flags: needs-work+
Assignee | ||
Comment 34•22 years ago
|
||
>> you've got double {{ / }} in these giant tables. Isn't there a better way of
>> declaring these tables so the results aren't so cryptic? can we get some
>> comments to indicate what each table does?
Those table were copied from jpnic source directly with very little
modification. We want to keep as less change as possible because the table might
be updated periodically, just as I did today.
>>we need WAY more comments here. What are IDX0? IDX_0? BITS1? How do these
>>lookup tables work?
>>Also, your spacing is way off - you're mixing tabs and spaces (there should be
>>NO tabs in source files)
If you use the standard of mozilla source code, then I have to rewrite
everything. Everyline start with tab was copied from their file, I tried to
avoid any kind of modificatin if possible, include naming, comment, whatsoever.
Just by looking at the naming of id and indentation, my code can be
differenciated from jpnic original code.
>>constants should not be declared with %{C++
>>instead you should use interface constants. For instance, put "const long
>>UNICODE_NFD = 1;" in your interface.
I can certainly make such change.
>>Another question I have is, why does nsIUnicodeNormalizer have one method,
>>which takes a "aNormForm" as a parameter, and then just does a switch()
>>statement based on that? Instead, nsIUnicodeNormalizer should have 4 methods
>>which call mdn_normalize with the right parameters each time. (i.e.
>>NormalizeFromNFD, NormalizeFromNFC, etc
>>With that change, I don't even think you'll need the UNICODE_NFD/etc #defines.
This convention originated from unicode standard documentation and their sample
implemenation. For all the implementation I investigated, they all use this
convention. I personaly don't think one way or the other have big advantage.
Comment 35•22 years ago
|
||
the advantage is smaller, faster code. Consider this simplified example:
void DoAction(int action)
{
switch (action) {
case 1: doAction1(); break;
case 2: doAction2(); break;
case 3: doAction3(); break;
}
}
but all the callers always hardcode the last parameter:
DoAction(1);
DoAction(2);
wouldn't it be much simpler to simply leave out DoAction() altogether and
instead call:
doAction1();
doAction2();
The code is smaller and faster because you're skipping decisions that have
already been made by the caller. The only reason you should be using a switch
statement is when the value passed in is dynamic and the caller doesn't actually
know which conversion needs to be done. However, I don't see the value being
passed in anywhere.. I can only assume it isn't dynamic.
As for the tabs/spaces, do you need to take diffs from future versions of the
source? I'm concerned that future maintenance on this file will result in a mix
of tabs and spaces.
Assignee | ||
Comment 36•22 years ago
|
||
Attachment #95124 -
Attachment is obsolete: true
Attachment #101318 -
Attachment is obsolete: true
Assignee | ||
Comment 37•22 years ago
|
||
>>As for the tabs/spaces, do you need to take diffs from future versions of the
>>source? I'm concerned that future maintenance on this file will result in a
>>mixof tabs and spaces.
I think so. If they fixed certain bugs, we might want to populate to our code
easily through applying the patch.
Answer 2 of your questions.
>>What exactly does it mean to normalize? It
>>looks like we're converting from UTF-16 into something else but I'm not sure.
We are converting from UTF-16 to UTF-16. In terms of encoding, there is no
change. Certain character can be represented in unicode in more than one form.
This practice is derive from legacy encoding support. If those different
representation forms are not normalized, certain operation like string
comparison does not make much sense.
>>CANONICAL_DECOMPOSITION and so forth. "kNFKD" is not intuitive anyway.
This NFC, NFK, ... have become the jargon in unicode world. This is very
similar to word like UTF-16. People who use this function should understand
what it means. Some people may not realize what NFKC (Compatibility
Decomposition, followed by Canonical Composition) really is when they are
using the form.
Comment 38•22 years ago
|
||
Comment on attachment 102252 [details] [diff] [review]
interface change, each form now has its own function.
alec, could you sr this?
shanjian put a new patch and answered your questions.
I am trying to land my IDN patch which depends on this bug.
Attachment #102252 -
Flags: superreview?(alecf)
Comment 39•22 years ago
|
||
Comment on attachment 102252 [details] [diff] [review]
interface change, each form now has its own function.
sr=alecf - thanks for the cleanup, and documentation.
Attachment #102252 -
Flags: superreview?(alecf) → superreview+
Comment 40•22 years ago
|
||
Shanjian,
Naoki wants to land his stuff before 1.3b. Please check this in soon. Thx.
Assignee | ||
Comment 41•22 years ago
|
||
fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago → 22 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•