Revision | $Revision: 1133 $ |
Date | $Date: 2005-05-20 15:28:53 -0500 (Fri, 20 May 2005) $ |
The CLDR test files provide the ability for people to check the results of their implementations against the data provided in CLDR.
The tests are written in a consistent format, so that even if an XML parser is not available, a simple script or function can parse out the contents of each line for testing.
The DTD defining the allowable structure for each file is in cldrTest.dtd. Each file contains number of test areas, indicated by a particular element. Currently test areas are provided for number, date, and collation. Each test area also uses an attribute to specify the locales it can be used for, such as:
<number locales="de de_AT de_BE de_DE de_LU">
That means that this test area can be used to test the following: German; German (Austria); German (Belgium); German (Germany); and German (Luxembourg).
Within each test area, there are a number of elements called results. Each of those lines contains zero or more settings (given by attributes) and a result as the contents of the element. The settings are additive within a test area: for each test area, each setting only needs to be provided if it is different than what is in the previous result. The two rows in Table 1: Additive Settings are equivalent, for example.
1. |
<result
dateType="none"
input="1900-01-31T00:00:00Z"
timeType="short">00:00</result>
<result
timeType="medium">00:00:00</result>
<result
timeType="long">00:00:00
GMT</result>
<result
timeType="full">00:00:00
GMT</result>
<result
dateType="short"
timeType="none">31-01-00</result>
<result
timeType="short">31-01-00
00:00</result>
|
2. |
<result
dateType="none"
input="1900-01-31T00:00:00Z"
timeType="short">00:00</result>
<result
dateType="none"
input="1900-01-31T00:00:00Z"
timeType="medium">00:00:00</result>
<result
dateType="none"
input="1900-01-31T00:00:00Z"
timeType="long">00:00:00
GMT</result>
<result
dateType="none"
input="1900-01-31T00:00:00Z"
timeType="full">00:00:00
GMT</result>
<result
dateType="short"
input="1900-01-31T00:00:00Z"
timeType="none">31-01-00</result>
<result
dateType="short"
input="1900-01-31T00:00:00Z"
timeType="short">31-01-00
00:00</result>
|
The date and time tests are included for locales whose exemplar characters are non-draft, while collation tests are included for all locales. Implementations that exclude additional data on the basis of its being draft will need to skip some of these tests.
For dates, the input value indicates what is to be formatted. It is represented in ISO 8601 format -- but the main format, without any variation, that is, always: yyyy-MM-dd'T'HH:mm:ss'Z', and thus always GMT 00:00. The input values are chosen to span all the months of the year, and important times. If the date or time type is none, then only the other part of the date time will be formatted.
For numbers, the input value indicates what is to be formatted. It is represented in standard C format, with three special values: Infinity, -Infinity, and NaN. If the application program does not format these values, then these input lines can be skipped. The input values are chosen to cover a range of possibilities -- if there are other distinguished values that would be useful, let us know.
The collation test area is special. Each result element consists of a number of lines, where each line compares as greater than or equal to the line before, according to the collation for the locale. The characters used come from the exemplar characters in CLDR, plus the set of tailored characters for the collation. In addition, the characters are closed under case; that means that if 'aa' is in the exemplar set, then all the combinations of case will show up: aa, aA, Aa, AA. That will sometimes result in some oddities: for example, 'ſ' long s will show up because it is a case variant of 's'.
Tertiary | Secondary | Primary | Identical |
---|---|---|---|
XB Xb xB xb xBx xbx |
Xe xe Xé xé xex xéx |
Xb xb xbx Xc xc xcx |
xة xت xة |
On top of that, the strings are prefixed by 'x' or 'X', and sometimes postfixed by 'x', as shown in Table 2: Example Patterns. This provides for a test of the strength of comparison. For example, when 'b' has a tertiary difference from 'B', the ordering will show up as in the Tertiary column example. Notice that in that column every other line has a 'b' of a different case. (In this locale (Danish), uppercase sorts first.)
With a secondary difference, we get the type of pattern shown in the Secondary column (for clarity, the uppercase variants are removed in these examples). Notice that rather than being every other line, as with the 'b's, the accented letters, having a secondary difference, are clumped together (except where a following 'x' separates them).
With a primary difference, of course, the letters are completely separated (again, removing the uppercase characters for clarity). There is a break between the line with the trailing x, and a following line without one. At the other end of the strength spectrum, if two strings are identical according to the collation (at the normal strength setting), then this is represented by repetition. For example, for Arabic we have what is in the Identical column. This will guarantee the right ordering, since a testing program will have xة ≤ xت ≤ xة, which can only be true if xة = xت.
The data is presented simply as successive lines rather than multiple elements, so that multiple lines can simply be copied and used in other circumstances, without having to remove the element structure. (In addition, it makes the files much smaller.)