Closed
Bug 12063
Opened 25 years ago
Closed 22 years ago
Need Unicode Normalization process...
Categories
(Core :: Internationalization, defect, P3)
Core
Internationalization
Tracking
()
People
(Reporter: ftang, Assigned: shanjian)
References
()
Details
(Keywords: helpwanted)
detail unknown. Need it after Beta 1.
Reporter | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Target Milestone: M15
Reporter | ||
Comment 1•25 years ago
|
||
Post Beta 1, Mark M15. Probably should code inside unicharutil....
Reporter | ||
Updated•25 years ago
|
Whiteboard: Help Wanted
Reporter | ||
Updated•25 years ago
|
Target Milestone: M15 → M20
Reporter | ||
Comment 2•25 years ago
|
||
Change it to M20
Updated•25 years ago
|
Keywords: helpwanted
Whiteboard: Help Wanted
Won't we need this for things like searching and comparison of Unicode data?
We need to make sure that the Unicode data we generate is normalized,
but I we should be able to assume incoming data is already normalized.
According to the W3 "Character Model for the World Wide Web", all the data
received should be normalized according to the Unicode Normalization Form C
(See: http://www.unicode.org/unicode/reports/tr15/tr15-17.html).
See http://www.w3.org/TR/1999/WD-charmod-19991129/#Normalization
The producer of text data MUST ensure that data is produced or sent out
in normalized form. For the purpose of W3C specifications and their
implementations, the producer of text data is the sender of the data in
the case of protocols. In the case of formats, it is the tool that
produces the data.
Normalization Test Suite:
http://www.unicode.org/unicode/reports/tr15/conformance/DraftTestSuite
For those of you who don't have password access to poke around until you
find the right file, the correct URL's are:
http://www.unicode.org/unicode/reports/tr15/conformance/DraftTestSuite.zip
http://www.unicode.org/unicode/reports/tr15/conformance/NormalizerTestSuite.txt
See http://www.macchiato.com/unicode/normalization_footprint.htm
Normalization Footprint Description
This document describes how much memory the different normalization forms
occupy at a minimum (e.g., with an implementation tuned for minimal
space consumption).
See also http://www.w3.org/TR/charmod
Character Model for the World Wide Web
http://www.w3.org/TR/charmod/#sec-Normalization
Section 4: Early Uniform Normalization
Note:
4.3 Responsibility for Normalization
Producers MUST produce text data in normalized form. For the purpose
of W3C specifications and their implementations, the producer of text
data is the sender of the data in the case of protocols and the tool
that produces the data in the case of formats.
Note: Implementers of producer software in the above sense are
encouraged to delegate normalization to their respective data
sources wherever possible. Examples of data sources are
operating systems, libraries, and keyboard drivers.
The recipients of text data MUST assume the data is normalized and
MUST NOT normalize it. Recipients which transcode text data from a
legacy encoding to a Unicode encoding form MUST use a
normalizing-transcoder
Comment 9•23 years ago
|
||
Normalization (checking) may become a requirement for XML 1.1:
<URL: http://www.w3.org/TR/2001/WD-xml11-20011213/#sec2.13 >.
Comment 10•22 years ago
|
||
Normilization form KC is needed for international domain name support.
http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-03.txt
Nomalization is included in ICU (http://oss.software.ibm.com/icu/).
It uses about 100kb of data file.
Blocks: 112979
Comment 11•22 years ago
|
||
Interface proposal.
open issues:
1) used byte count or char count for length arguments?
2) use UTF-16 or UTF-32?
3) should caller allocate out buffer or callee?
4) should this belong to uconv or somewhere else?
5) can we use ICU implementation?
#define NS_ERROR_UNORM_MOREOUTPUT \
NS_ERROR_GENERATE_FAILURE(NS_ERROR_MODULE_UCONV, 0x51)
typedef enum {
kNFD, // Canonical Decomposition
kNFC, // Canonical Decomposition,
// followed by Canonical Composition
kNFKD, // Compatibility Decomposition
kNFKC // Compatibility Decomposition,
// followed by Canonical Composition
} nsUnicodeNorilizationForm;
/**
* Normilize Unicode.
*
* @param aNormForm [IN] Normilization form.
* @param aSrc [IN] A pointer to an input UTF-16 string.
* @param aSrcLength [IN] A length of the input (in 16bit unit).
* @param aDest [OUT] A pointer to an output buffer supplied by a caller.
* @param aDestBuffLength [IN] A length of the caller supplied buffer (in 16bit
unit).
* @param aDestLength [OUT] A length of the normilized UTF-16 string (in 16bit
unit).
* @return NS_OK for success,
* NS_ERROR_UNORM_MOREOUTPUT if the supplied out buffer not
large enough.
*/
nsresult NormilizeUnicode(nsUnicodeNorilizationForm aNormForm,
const PRUnichar *aSrc, PRUint32 aSrcLength,
PRUnichar *aDest, PRUint32 aDestBuffLength,
PRUint32 *aDestLength);
Reporter | ||
Updated•22 years ago
|
Target Milestone: Future → ---
Reporter | ||
Comment 12•22 years ago
|
||
shanjian, can you implement a normalizer and compose / decompose code in
mozilla? Maybe we can port the ICU code or write our own.
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Assignee | ||
Updated•22 years ago
|
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → DUPLICATE
Assignee | ||
Comment 13•22 years ago
|
||
*** This bug has been marked as a duplicate of 8275 ***
You need to log in
before you can comment on or make changes to this bug.
Description
•