Local Collation Sequences

A collation sequence determines the sorting order of any given character set. For example, in English, the order of the alphabet is most commonly used to order a list of English words. That is, words beginning with the letter c appear before words beginning with the letter d, which appear before words beginning with the letter e and so on.

A collation sequence is associated with each database when the database is created. This sequence determines in what order sorted data is returned to users and applications, what is returned when queries use pattern matching and, in some instances, how data is stored internally.

Supported Collation Sequences

In a computer, if no other collation sequence is enforced, the sequence derived from the machine's native character set, either ASCII or EBCDIC is used. The sorting order of these character sets derives from the internal numeric representation of each character.

In addition to the sequences derived from ASCII and EBCDIC, Ingres supports two other local collation sequences. They are:

Multi, a sequence derived from the DEC Multinational Character Set
Spanish, a sequence derived from the Spanish language.

If none of these sequences adequately fills your needs, you can write your own local collation sequence.

MultiCollation Sequence

The multi collation sequence is based on the DEC Multinational Character Set. This character set adds several vowels with diacritical marks to the standard 7-bit ASCII character set.

Following are the comparison sequences for the multi sequence that differ from those of ASCII:

A < À < Á < Â < Ã < Ä < B

C < Ç < D < E < È < É < Ê < Ë

I < Ì < Í < Î < Ï < J

N < Ñ < O

O < Ò < Ó < Ô < Õ < Ö < Œ < P

U < Ù < Ú < Û < Ü < V

Y < Ÿ < Z < Æ< Ø < Å

a < à < á < â < ã < ä < b

c < ç < d < e < è < é < ê < ë

i < ì < í < î < ï < j

n < ñ < o

o < ò < ó < ô < õ < ö < œ < p

ss < ß < st

u < ù < ú < û < ü < v

y < ÿ < z < æ < ø < å

For example:

cote < côte < czar < cæsar

Pattern matching rules:

Œ, œ, ß are special when pattern match searching is used.
Œ matches O_ or O% as well as Œ
œ matches o_ or o% as well as œ
ß matches s_ or s% as well as ß

Spanish Collation Sequence

The Spanish collation sequence is based on the multi sequence but contains additional support for the Spanish letters ll and ch. Listed below are the comparison sequences for the Spanish collation sequence that differ from those of ASCII. Some pattern matching rules are also described.

A < À < Á <Â < Ã < Ä < B

CZ < CÅ < Ç < CH < Ch < D < E < È < É < Ê < Ë

I < Ì < Í < Î < Ï < J

LZ < LÅ < LL < LI < M

N < Ñ < O

O < Ò < Ó< Ô < Õ < Ö < Œ < P

U < Ù < Ú < Û < Ü <V

Y < Ÿ < Z <Æ< Ø < Å

a < à < á < â < ã < ä < b

cz < câ < ç < cH < ch < d < e < è < é < ê < ë

i < ì < í < î < ï < j

lz < lå < lL < ll < m

n < ñ < o

o < ò < ó < ô < õ < ö < œ < p

ss < ß < st

u < ù < ú < û < ü < v

y < ÿ < z < æ < ø < å

Examples:

loop < llama

cote < côte < czar < cæsar < chair

The pattern matching rules are:

Œ, œ, ß, Ç, ç are special when pattern match searching is used.
Œ matches O_ or O% as well as Œ
œ matches o_ or o% as well as œ
ß matches s_ or s% as well as ß
Ç matches C_ or C% as well as Ç
ç matches c_ or c% as well as ç

Custom Collation Sequence

If you have special needs that are not met by the available collation sequences, you can write your own. Ingres allows you to write a collation sequence that has any of the following characteristics:

Character skipping—one or more specified characters are ignored for collation
One-to-one mapping—a character can be substituted for another or weighted differently for collation
Many-to-one mapping—groups of characters can be substituted for a single character or weight value for collation
Many-to-many mapping—groups of characters can be substituted for a sequence of characters or weight values for collation

Guidelines for Creating a Custom Collation File

Keep the following points in mind as you design and test your custom collation file:

Never create a production database with an untested collation sequence. Always test your collation file on a sample database. Each time that you modify the collation sequence to correct any bugs, you must unload the database, destroy the old database, install the new sequence, create a database with the new sequence, and reload the database.
Some collation sequences allow two strings that are different to compare as equal. These sequences are called information loss sequences. An example of this type of sequence is a sequence that ignores case.
Problems that can result from such a sequence are:
- If duplicates are not allowed, the DBMS drops all but one string.
- If duplicates are not allowed, the DBMS does not allow you to add a row to a table if it appears to match an existing row.
- The hash storage structure cannot detect when two equal but different strings are placed in a hashed relation.
- In a query on a hash table, the "=" operator can only fetch one of the 'equal' strings that matches the expression.
  Because of these problems, we suggest that you do not use information loss sequences and the hash storage structure together.

How You Write a Customize Collation Sequence

To create a customized collation sequence, follow these steps:

Write a description file.
Run the description file through the aducompile utility.
Test your collation sequence with a small sample database.

Description File—Describe Collation Sequence

To define a custom collation sequence you must create a description file, which consists of a list of "instructions" that, taken as a whole, describe the collation sequence. Each instruction must appear on a separate line in the file.

The format of each instruction is:

value:string

where:

value
Determines the numerical weight assigned to string. (The internal numerical weight of each character determines where a character appears in the sort order.)

The value can have any of the following formats:
- char+number
  Instructs sorting of the specified string after the specified character and before the next higher-weighted character in the character set. For example, in the following instruction, string1 is mapped as a single character that is ordered immediately after the letter H and before I in a sorted sequence:
  
  H+1:string1
  
  In the following instruction, string2 sorts after string1 and before the letter I.
  
  H+2:string2
  
  You can specify H+1:string or Hz+1:string and both sorts in the same manner, that is, after H and before I. However, the two examples do not behave the same when pattern matching is applied. To illustrate using an example from the Spanish language, the following instruction maps CH as a single character that exists between C and D:
  
  C+1:CH
  
  If you ask for a pattern match using the format C%, instances of CH are not returned. The alternative, Cz+1:CH, maps CH into two characters, C and a virtual character just after z. This causes CH to match as two characters. A pattern match using the format C% finds the instances of CH.
- charstring
  Sorts the specified string as the equivalent of the specified charstring. For example, in the following instruction, the word tax sorts as if it were the word revenue:
  
  revenue:tax
- +number
  Gives the specified string the internal numerical weight specified by given number. The number must be between 0 and 32766. The weighting of a character in this manner is less portable than giving the character a relative weight.
- +*
  Causes the specified string to be ignored when collation is performed. For example, in the following instruction, the "?"is ignored whenever collation takes place.
  
  +*:?
- (empty)
  When no value is specified (the instruction takes the form: string), the collation compiler ignores the instruction. Use this format to insert comments into your collation sequence. For example:
  
  :This is a comment
string
Is any character or character string. An empty string causes a syntax error.

The aducompile Utility

The aducompile utility compiles your description file into a binary file and installs that file as a collation sequence that can be used. You must be the installation owner to use this utility. Be sure to give your resulting collation file a unique name so that you do not overwrite any existing collation files.

Your new collation sequence is located at $II_SYSTEM/ingres/files/collation/collation_name.

Note: In UNIX, all system users must have rights to read the new collation file.