Ordering and comparing text in Rails and MySQL
Ordering and comparing text is a lot trickier than most people expect. Computer scientists even came up with a complicated name for it: “collation”.
There are two groups of problems associated with collation: cultural and technical. Today we’re not going to focus on technical problems, but rather how the cultural problems influence the technical solution.
An example of a cultural difference is letter ordering in a language, in Swedish the Ä is ordered after the z and in German it follows the letter a. You can also discuss whether the ä is an alternative form of the a or if it’s a completely different character, this is relevant when implementing search. When searching for ‘nächste’ you might also be interested in text containing ‘nachste’.
MySQL implements a number of collations solutions so you can use the one relevant for your application. For instance utf8_icelandic_ci
when you want to order UTF-8 encoded text based on the cultural norm in Iceland. You can get a full list of supported collations from the mysql client by running a SHOW COLLATION
query.
When you have an international site that might contain multiple languages you can use the default Unicode Collation Algorithm, which is called utf8_unicode_ci
in MySQL. This UCA is pretty sensible so a lot of frameworks, including Rails, use it as a default. Unfortunately this collation also changes character equality. Lets look at an example how this might go wrong.
mysql> CREATE DATABASE books_example CHARACTER SET utf8
COLLATE utf8_unicode_ci;
mysql> USE books_example;
mysql> CREATE TABLE books ( title VARCHAR(255) );
mysql> SHOW FIELDS FROM books;
+-------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| title | varchar(255) | YES | | NULL | |
+-------+--------------+------+-----+---------+-------+
1 row in set (0.00 sec)
mysql> INSERT INTO books SET title = 'Pokemon';
mysql> INSERT INTO books SET title = 'pokemon';
Now we have a database with books, currently with two entries:
mysql> SELECT * FROM books;
+---------+
| title |
+---------+
| Pokemon |
| pokemon |
+---------+
2 rows in set (0.00 sec)
A sensible action might be to select a book by name:
mysql> SELECT * FROM books WHERE title = 'pokemon';
+---------+
| title |
+---------+
| Pokemon |
| pokemon |
+---------+
2 rows in set (0.00 sec)
But unfortunately this returns both books because in utf8_unicode_ci
P equals p, in the same way a equals ä equals A. This is useful for ordering and searching but not for selecting.
We can fix it by specifying a binary collation algorithm for the select so it will not use these fuzzy equality rules:
mysql> SELECT * FROM books WHERE title = 'pokemon' COLLATE utf8_bin;
+---------+
| title |
+---------+
| pokemon |
+---------+
1 row in set (0.00 sec)
Note that this problem can introduce a lot of bugs and maybe even security problems. For instance, imagine two accounts: ‘Manfred’ and ‘manfred’. With the following query it is undetermined which of these two will be returned:
SELECT * FROM accounts WHERE username = 'manfred' LIMIT 1
My advise is to set the default collation for your database to utf8_bin
and include the collation in queries where you want to order the entries nicely for the user interface or when you need fuzzy equality for searching..
development:
database: books_development
adapter: mysql
encoding: utf8
collation: utf8_bin
In Rails you can specify the collation in database.yml
.