Ice9web.com

 

GLOBALIZATION & LOCALIZATION

Database tables are used to store and retrieve data. Different languages and character sets need to be stored and retrieved differently. As such, MySQL needs to accommodate different character sets (different alphabets and characters) as well as different ways to sort and retrieve data.
When discussing multiple languages and characters sets, you will run into the following important terms:

  • Character sets are collections of letters and symbols.
  • Encodings are the internal representations of the members of a character set.
  • Collations are the instructions that dictate how characters are to be compared.

 

Understanding Character Sets and Collation Sequences

Database tables are used to store and retrieve data. Different languages and character sets need to be stored and retrieved differently. As such, MySQL needs to accommodate different character sets (different alphabets and characters) as well as different ways to sort and retrieve data.

When discussing multiple languages and characters sets, you will run into the following important terms:

  • Character sets are collections of letters and symbols.
  • Encodings are the internal representations of the members of a character set.
  • Collations are the instructions that dictate how characters are to be compared.

 

Note

Why Collations Are Important Sorting text in English is easy, right? Well, maybe not. Consider the words APE, apex, and Apple. Are they in the correct sorted order? That would depend on whether you wanted a case-sensitive or a not case-sensitive sorting.

The words would be sorted one way using a case-sensitive collation, and another way using a not case-sensitive collation. And this affects more than just sorting (as in data sorted using ORDER BY); it also affects searches (whether or not a WHERE clause looking for apple finds APPLE, for example). The situation gets even more complex when characters such as the French à or German ö are used, and even more complex when non-Latin-based character sets are used (Japanese, Hebrew, Russian, and so on).

In MySQL there is not much to worry about during regular database activity (SELECT, INSERT, and so forth). Rather, the decision as to which character set and collation to use occurs at the server, database, and table level.

 

Working with Character Set and Collation Sequences

MySQL supports a vast number of character sets. To see the complete list of supported character sets, use this statement:

• Input

SHOW CHARACTER SET;

Working with Character Set and Collation Sequences

• Analysis

This statement displays all available character sets, along with the description and default collation for each.

To see the complete list of supported collations, use this statement:

• Input

SHOW COLLATION;

Working with Character Set and Collation Sequences

• Analysis

This statement displays all available collations, along with the character sets to which they apply. You will notice that several character sets have more than one collation. latin1, for example, has several for different European languages, and many appear twice, once case sensitive (designated by _cs) and once not case sensitive (designated by _ci).

A default character set and collation are defined (usually by the system administration at installation time). In addition, when databases are created, default character sets and collations may be specified, too. To determine the character sets and collations in use, use these statements:

• Input

SHOW VARIABLES LIKE 'character%';
SHOW VARIABLES LIKE 'collation%';

Working with Character Set and Collation Sequences

In practice, character sets can seldom be server-wide (or even database-wide) settings. Different tables, and even different columns, may require different character sets, and so both may be specified when a table is created.

To specify a character set and collation for a table, CREATE TABLE (seen in Tutorial 26, "Creating and Manipulating Tables") is used with additional clauses:

• Input

CREATE TABLE mytable
(
   columnn1   INT,
   columnn2   VARCHAR(10)
) DEFAULT CHARACTER SET hebrew
  COLLATE hebrew_general_ci;

 

• Analysis

This statement creates a two column table, and specifies both a character set and a collate sequence.

In this example both CHARACTER SET and COLLATE were specified, but if only one (or neither) is specified, this is how MySQL determines what to use:

  • If both CHARACTER SET and COLLATE are specified, those values are used.
  • If only CHARACTER SET is specified, it is used along with the default collation for that character set (as specified in the SHOW CHARACTER SET results).
  • If neither CHARACTER SET nor COLLATE are specified, the database default is used.

In addition to being able to specify character set and collation table wide, MySQL also allows these to be set per column, as seen here:

• Input

CREATE TABLE mytable
(
   columnn1   INT,
   columnn2   VARCHAR(10),
   column3    VARCHAR(10) CHARACTER SET latin1 COLLATE latin1_general_ci
) DEFAULT CHARACTER SET hebrew
  COLLATE hebrew_general_ci;


• Analysis

Here CHARACTER SET and COLLATE are specified for the entire table as well as for a specific column.

As mentioned previously, the collation plays a key role in sorting data that is retrieved with an ORDER BY clause. If you need to sort specific SELECT statements using a collation sequence other than the one used at table creation time, you may do so in the SELECT statement itself:

• Input

SELECT * FROM customers
ORDER BY lastname, firstname COLLATE latin1_general_cs;

 

• Analysis

This SELECT uses COLLATE to specify an alternate collation sequence (in this example, a case-sensitive one). This will obviously affect the order in which results are sorted.

 

Tip

Occasional Case Sensitivity The SELECT statement just seen demonstrates a useful technique for performing case-sensitive searches on a table that is usually not case sensitive. And of course, the reverse works just as well.
 

Note

Other SELECT COLLATE Clauses In addition to being used in ORDER BY clauses, as seen here, COLLATE can be used with GROUP BY, HAVING, aggregate functions, aliases, and more.

One final point worth noting is that strings may be converted between character sets if absolutely needed. To do this, use the Cast() or Convert() functions.

 

Summary

In this tutorial, you learned the basics of character sets and collations. You also learned how to define the character sets and collations for specific tables and columns, and how to use alternate collations when needed. 

[Next]