[Unicode]  Unicode Collation Algorithm
 

UCA Auxiliary Files

Version 6.0.0
2010-10-27

The files in this directory provide remapping and tailoring data for the UCA Version 6.0.0 DUCET weights, for use with CLDR. These files are large, and thus packaged in zip format to save download time.

CLDR Tailoring

As of version 1.9, CLDR is using a tailored UCA DUCET in the root locale. This is used by all other locales by default. However, there is separate collation tailoring also in root, with the keyword “ducet”, that tailors the modified DUCET back to the original. Using that keyword, the locale ID “und-u-co-ducet” allows access to the original DUCET.

The root locale ordering is tailored in the following ways:

Reordering of Common characters. The DUCET ordering puts characters into roughly the following ordering:

The CLDR root locale tailored orders the common characters strictly by category:

The relative order within each of these groups still matches the DUCET. Symbols, punctuation, and numbers that are grouped with a particular script stay with that script. The only two exceptions are two currency symbols that are moved up to be with the other currency symbols:

This regrouping only matters in comparison where a common character in one group is compared to a common character in another, such as if “I♥NY” were compared to “I-NY”, where a symbol is compared to a punctuation mark. What the regrouping allows is for users to parametrically reorder the groups. For example, users can reorder numbers after all scripts, or reorder Greek before Latin.

Symbols non-variable. There are two options in the UCA for symbols and punctuation: non-ignorable, or shifted. With the shifted option, almost all symbols and punctuation are ignored -- except at a fourth level. The root locale ordering is modified so that symbols are not affected by the shifted option. So shifted only causes whitespace and punctuation to be ignored, but not symbols (like ♥). The old behavior can be specified with a locale ID using the "vt" keyword, to set the Variable section to include all of the symbols below it, or be set parametrically where implementations allow access. See also:

Tailored noncharacter weights.

The code point U+FFFF is tailored to have a weight higher than all other characters. This allows reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\uFFFF” to include all strings starting with "sch" or equivalent.

The code point U+FFFE is tailored to have a weight lower than all other characters. This allows for Interleaved_Levels within code point space.

In CLDR, these values are not further tailorable, and nothing can tailor to them. That is, neither can occur in a collation rule: for example, the following rules are illegal:

& \uFFFF < x

& x <\uFFFF

File Formats

The file formats may change between versions of the UCA. The formats for Version 6.0 are as follows. As usual, text after a # is a comment.

FractionalUCA_summary.txt

The lines are all comments, giving an overview of the weight structure. Lines are tab delimited, for use with spreadsheets. Most are in the format:

# position fractional-weight Script General-category Codepoint Name

where position is "First" or "Last"

Example

# First 03 05 Zyyy Cc U+0009 <CHARACTER TABULATION>
# Last 03 15 Zyyy Zp U+2029 PARAGRAPH SEPARATOR

The topmost variable weight is marked with the following format:

[variable top = 0C 32 04] # END OF VARIABLE SECTION!!!

The purpose of other lines is given by comments.

FractionalUCA.txt

The format is illustrated by the following sample lines, with commentary supplied in italics afterwards.

[UCA version = 6.0.0]

The version number

0000; [,,] # Zyyy Cc [0000.0000.0000] * <NULL>

Provides a weight line. The first element (before the ";") is a hex codepoint sequence. The second field is a sequence of collation elements. Each collation element has 3 parts separated by commas: the primary weight, secondary weight, and tertiary weight. A weight is either empty (meaning a zero or ignorable weight) or is a sequence of one or more bytes. The bytes are interpreted as a "fraction", meaning that the ordering is 04 < 05 05 < 06. The weights are constructed so that no weight is an initial subsequence of another: that is, having both the weights 05 and 05 05 is illegal. The above line consists of all ignorable weights.
...
0009; [03 05, 05, 05] # Zyyy Cc [0100.0020.0002] * <CHARACTER TABULATION>
...
1B60; [06 14 0C, 05, 05] # Bali Po [0111.0020.0002] * BALINESE PAMENENG
...
0031; [14, 05, 05] # Zyyy Nd [149B.0020.0002] * DIGIT ONE

Single-byte weights are given to particularly frequent characters, such as space, digits, and a-z. Most characters are given two-byte weights, while relatively infrequent characters are given three-byte weights. The assignment of 2 vs 3 bytes does not reflect importance, or exact frequency.

# SPECIAL MAX/MIN COLLATION ELEMENTS

FFFE; [02, 02, 02] # Special LOWEST primary, for merge/interleaving
FFFF; [EF FE, 05, 05] # Special HIGHEST primary, for ranges

The two tailored noncharacters have their own weights.

# SPECIAL FINAL VALUES for Script Reordering

FDD0 0042; [05 FE, 05, 05] # Special final value for reordering token
FDD0 0043; [0C FE, 05, 05] # Special final value for reordering token

There are special values assigned to code point sequences FDD0+X. These sequences are simply used to communicate special values, and can be eliminated. For the reordering values, the purpose is to make sure that there is a "high" weight at the end of each reordering group.
...
# HOMELESS COLLATION ELEMENTS
FDD0 0063; [, 97, 3D] # [15E4.0020.0004] [1844.0020.0004] [0000.0041.001F] * U+01C6 LATIN SMALL LETTER DZ WITH CARON
FDD0 0064; [, A7, 09] # [15D1.0020.0004] [0000.0056.0004] * U+1DD7 COMBINING LATIN SMALL LETTER C CEDILLA
FDD0 0065; [, B1, 09] # [1644.0020.0004] [0000.0061.0004] * U+A7A1 LATIN SMALL LETTER G WITH OBLIQUE STROKE

The DUCET has some weights that don't correspond directly to a character. To allow for implementations to have a character associated with each weight (necessary for certain implementations of tailoring), this requires the construction of special sequences for those weights.

# VALUES BASED ON UCA
...
[first regular [0D 0A, 05, 05]] # U+0060 GRAVE ACCENT
[last regular [7A FE, 05, 05]] # U+1342E EGYPTIAN HIEROGLYPH AA032
[first implicit [E0 04 06, 05, 05]] # CONSTRUCTED
[last implicit [E4 DF 7E 20, 05, 05]] # CONSTRUCTED
[first trailing [E5, 05, 05]] # CONSTRUCTED
[last trailing [E5, 05, 05]] # CONSTRUCTED
...

The above table summarizes ranges of important groups of characters for implementations.

# Top Byte => Reordering Tokens
[top_byte 00 TERMINATOR ] # [0] TERMINATOR=1
[top_byte 01 LEVEL-SEPARATOR ] # [0] LEVEL-SEPARATOR=1
[top_byte 02 FIELD-SEPARATOR ] # [0] FIELD-SEPARATOR=1
[top_byte 03 SPACE ] # [9] SPACE=1 Cc=6 Zl=1 Zp=1 Zs=1
...

The above table maps from the first bytes of the fractional weights to a reordering token. The format is "[top_byte " byte-value reordering-token "COMPRESS"? "]". The "COMPRESS" value is present when there is only one byte in the reordering token, and primary-weight compression can be applied. Most reordering tokens are script values; others are special-purpose values, such as PUNCTUATION.

# Reordering Tokens => Top Bytes
[reorderingTokens Arab 61=910 62=910 ]
[reorderingTokens Armi 7A=22 ]
[reorderingTokens Armn 5F=82 ]
[reorderingTokens Avst 7A=54 ]
...

The above table is an inverse mapping from reordering token to top byte(s). In terms like "61=910", the first value is the top byte, while the second is informational, indicating the number of primaries assigned with that top byte.

# General Categories => Top Byte
[categories Cc 03{SPACE}=6 ]
[categories Cf 77{Khmr Tale Talu Lana Cham Bali Java Mong Olck Cher Cans Ogam Runr Orkh Vaii Bamu}=2 ]
[categories Lm 0D{SYMBOL}=25 0E{SYMBOL}=22 27{Latn}=12 28{Latn}=12 29{Latn}=12 2A{Latn}=12...

The above table is informational, providing the top bytes, scripts, and primaries associated with each general category value.

# FIXED VALUES
[fixed first implicit byte E0]
[fixed last implicit byte E4]
[fixed first trail byte E5]
[fixed last trail byte EF]
[fixed first special byte F0]
[fixed last special byte FF]

The final table gives certain hard-coded byte values. The "trail" area is provided for implementation of the "trailing weights" as described in the UCA.

allkeys_CLDR.txt

A reordering of the DUCET allkeys file for CLDR which guarantees that characters with weights less than the Latin letter 'a' are in the order: spaces, punctuation, general-symbols, currency-symbols, and numbers. (Here, general-symbols includes anything else in the DUCET with weights less than the Latin letter 'a', thus including somecharacters with the General_Category Lm.) This tailoring also sets only spaces and punctuation to be variable. Unlike allkeys.txt, the ordering is by non-ignorable sort order, and the primary values may overlap with secondaries; if non-overlap is required in the implementation, non-zero primaries should be offset by an appropriate amount. Because of the preprocessing, some values may have somewhat different weights, but the results (other than the above changes) should be the same.

The format is similar to that of allkeys.txt, although there may be some differences in whitespace.

UCA_Rules.txt

The format uses the CLDR "short" format for collation tailoring. Here is an illustration of the format:

< 𝍱 # 5.0 [No] [1499.0020.0002] U+1D371 COUNTING ROD TENS DIGIT NINE
< 0 # 1.1 [Nd] [149A.0020.0002] U+0030 DIGIT ZERO

The ASCII ZERO is primary-greater than U+1D371 COUNTING ROD TENS DIGIT

<<< 0 # 1.1 [Nd] [149A.0020.0003] U+FF10 FULLWIDTH DIGIT ZERO
The fullwidth ZERO is tertiary-after the ASCII ZERO.

<<< 🄁 / ',' # 5.2 [Nd/No] [149A.0020.0004] [011F.0020.0004] U+1F101 DIGIT ZERO COMMA / 002C
The U+1F101 DIGIT ZERO COMMA is tertiary-greater than the fullwidth ZERO followed by U+002C (comma)

= 𝟘 # 3.1 [Nd] [149A.0020.0005] U+1D7D8 MATHEMATICAL DOUBLE-STRUCK DIGIT ZERO
The 05] U+1D7D8 MATHEMATICAL DOUBLE-STRUCK DIGIT ZERO is primary, secondary, and tertiary equal to U+1F101 DIGIT ZERO COMMA

For more information, see LDML.

UCA_Rules.xml

Provides the same rule set in CLDR XML format. For more information, see LDML.

UCA_Rules_NoCE.txt

Omits the (remapped) DUCET value, for more effective comparison across versions.