Closed Bug 12063 Opened 25 years ago Closed 22 years ago

Need Unicode Normalization process...

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

VERIFIED DUPLICATE of bug 8275

People

(Reporter: ftang, Assigned: shanjian)

References

()

Details

(Keywords: helpwanted)

detail unknown. Need it after Beta 1.
Status: NEW → ASSIGNED
Target Milestone: M15
Post Beta 1, Mark M15. Probably should code inside unicharutil....
Whiteboard: Help Wanted
Target Milestone: M15 → M20
Change it to M20
Keywords: helpwanted
Whiteboard: Help Wanted
Won't we need this for things like searching and comparison of Unicode data?
QA Contact: teruko → ftang
We need to make sure that the Unicode data we generate is normalized, but I we should be able to assume incoming data is already normalized. According to the W3 "Character Model for the World Wide Web", all the data received should be normalized according to the Unicode Normalization Form C (See: http://www.unicode.org/unicode/reports/tr15/tr15-17.html). See http://www.w3.org/TR/1999/WD-charmod-19991129/#Normalization The producer of text data MUST ensure that data is produced or sent out in normalized form. For the purpose of W3C specifications and their implementations, the producer of text data is the sender of the data in the case of protocols. In the case of formats, it is the tool that produces the data.
mark as future
Target Milestone: M20 → Future
See http://www.macchiato.com/unicode/normalization_footprint.htm Normalization Footprint Description This document describes how much memory the different normalization forms occupy at a minimum (e.g., with an implementation tuned for minimal space consumption). See also http://www.w3.org/TR/charmod Character Model for the World Wide Web http://www.w3.org/TR/charmod/#sec-Normalization Section 4: Early Uniform Normalization Note: 4.3 Responsibility for Normalization Producers MUST produce text data in normalized form. For the purpose of W3C specifications and their implementations, the producer of text data is the sender of the data in the case of protocols and the tool that produces the data in the case of formats. Note: Implementers of producer software in the above sense are encouraged to delegate normalization to their respective data sources wherever possible. Examples of data sources are operating systems, libraries, and keyboard drivers. The recipients of text data MUST assume the data is normalized and MUST NOT normalize it. Recipients which transcode text data from a legacy encoding to a Unicode encoding form MUST use a normalizing-transcoder
Normalization (checking) may become a requirement for XML 1.1: <URL: http://www.w3.org/TR/2001/WD-xml11-20011213/#sec2.13 >.
Normilization form KC is needed for international domain name support. http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-03.txt Nomalization is included in ICU (http://oss.software.ibm.com/icu/). It uses about 100kb of data file.
Blocks: 112979
Interface proposal. open issues: 1) used byte count or char count for length arguments? 2) use UTF-16 or UTF-32? 3) should caller allocate out buffer or callee? 4) should this belong to uconv or somewhere else? 5) can we use ICU implementation? #define NS_ERROR_UNORM_MOREOUTPUT \ NS_ERROR_GENERATE_FAILURE(NS_ERROR_MODULE_UCONV, 0x51) typedef enum { kNFD, // Canonical Decomposition kNFC, // Canonical Decomposition, // followed by Canonical Composition kNFKD, // Compatibility Decomposition kNFKC // Compatibility Decomposition, // followed by Canonical Composition } nsUnicodeNorilizationForm; /** * Normilize Unicode. * * @param aNormForm [IN] Normilization form. * @param aSrc [IN] A pointer to an input UTF-16 string. * @param aSrcLength [IN] A length of the input (in 16bit unit). * @param aDest [OUT] A pointer to an output buffer supplied by a caller. * @param aDestBuffLength [IN] A length of the caller supplied buffer (in 16bit unit). * @param aDestLength [OUT] A length of the normilized UTF-16 string (in 16bit unit). * @return NS_OK for success, * NS_ERROR_UNORM_MOREOUTPUT if the supplied out buffer not large enough. */ nsresult NormilizeUnicode(nsUnicodeNorilizationForm aNormForm, const PRUnichar *aSrc, PRUint32 aSrcLength, PRUnichar *aDest, PRUint32 aDestBuffLength, PRUint32 *aDestLength);
Target Milestone: Future → ---
shanjian, can you implement a normalizer and compose / decompose code in mozilla? Maybe we can port the ICU code or write our own.
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Depends on: 8275
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → DUPLICATE
*** This bug has been marked as a duplicate of 8275 ***
No longer blocks: 112979
verified duplicate
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.