Collation
This feature is in Public Preview.
Applies to: Databricks SQL
Databricks Runtime 16.1 and later
A collation is a set of rules that determines how string comparisons are performed. Collations are used to compare strings in a case-insensitive, accent-insensitive, or trailing space insensitive manner, or to sort strings in a specific language-aware order.
Strings in Databricks are represented as UTF-8 encoded Unicode characters.
By default Databricks compares strings by their binary UTF8 representation. This is known as UTF8_BINARY
collation.
UTF8_BINARY
comparisons are fast and appropriate in many cases, but may not be suitable for all applications, especially those that require language-aware sorting or comparisons.
Aside from language-aware comparisons, a common use case is enforcing case-insensitivity.
Databricks has the UTF8_LCASE
collation specifically for this purpose.
It converts strings to lowercase before comparing them using the fast UTF8_BINARY
collation.
For language-aware comparisons, Databricks employs the following technologies:
- International Components for Unicode (ICU) library to compute collation
- Common Locale Data Repository (CLDR) tables for specific locale-aware collation.
- Unicode Locale Data Markup Language (LDML) to encode collations internally.
These technologies are encapsulated in a set of named collations that can be used in SQL statements.
Collation names
Because identifying collations by their LDML specification can be complex and challenging to read, Databricks has a set of easier-to-use named system collations.
Syntax
{ UTF8_BINARY |
UTF8_LCASE |
{ UNICODE | locale } [ _ modifier [...] ] }
locale
language_code [ _ script_code ] [ _ country_code ]
modifier
{ CS | CI | AS | AI | RTRIM }
-
UTF8_BINARY
A meta-locale binary collation that compares strings byte by byte based on the UTF-8 byte representation.
UTF8_BINARY
is the default and most lightweight collation for string comparison in Databricks.In this collation ‘A’ (x’65’) < ‘B’ (x’66’) < … < ‘Z’ (x’90’).
However, ‘Z’ (x’90’) < ‘a’ (x’97’), and ‘A’ (x’65’) <> ‘a’ (x’97’).
Further, characters such as ‘Ä’ (x’C384’) are greater than ‘Z’ and ‘z’ in this collation.
-
UTF8_LCASE
A lightweight meta-locale case-insensitive collation that compares strings using their UTF-8 byte representation after converting the strings to lower case.
UTF8_LCASE
is the collation used for identifiers in Databricks.For example:
SQLORDER BY col COLLATE UTF8_LCASE
is equivalent to
SQLORDER BY LOWER(col) COLLATE UTF8_BINARY
-
UNICODE
The ICU root locale.
This collation, known in CLDR as the ‘root’ locale (LDML specification: ‘und-u’) imposes a language agnostic order, which tries to be intuitive overall. In this collation, like characters are grouped. For example: ‘a’ < ‘A’ < ‘Ä’ < ‘b’. ‘A’ is not considered equivalent to ‘a’. Therefore, the collation is case-sensitive. ‘a’ is not considered equivalent to ‘ä’. Therefore, the collation is accent-sensitive.
-
locale
A locale-aware collation based on the CLDR tables.
The locale is specified as a language code, an optional script code, and an optional country code.
locale
is case-insensitive.- language_code: A two-letter ISO 639-1 language code.
- script_code: A four-letter ISO 15924 script code.
- country_code: A three-letter ISO 3166-1 alpha-3 country code.
-
modifier
Specifies the collation behavior regarding case sensitivity and accent sensitivity.
- CS: Case-sensitive. The default behavior.
- CI: Case-insensitive.
- AS: Accent-sensitive. The default behavior.
- AI: Accent-insensitive.
Applies to:
Databricks Runtime 16.2 and later
- RTRIM: Trailing space insensitive. Trims trailing spaces (‘u0020’) before comparison.
Applies to:
Databricks Runtime 16.2 and later
You can specify
RTRIM
, eitherCS
orCI
, and eitherAS
orAI
at most once and in any order. The modifiers themselves are case-insensitive.
When processing a collation, Databricks normalizes collation names by removing defaults.
For example, SR_CYR_SRN_CS_AS
is normalized to SR
.
For a list of supported collations, see Supported collations.
Examples
-- You can fully qualify collations, and case doesn't matter.
system.builtin.unicode
-- Since all collations are system defined you don't need to qualify them
unicode
-- Using 2-letter language code only for german collation
DE
-- Using 2-letter language code and 3-letter country code for french-canadian collation
-- It is common to use lower case 2-letter language codes and upper case 3-letter country codes
-- But collation names are case insensitive
fr_CAN
-- Using 2-letter language code and 4-letter script code and 3-letter country code for traditional chinese in Macao
zh_Hant_MAC
-- Using a 2 letter german language code and 2 modifiers for case insensitive and accent insensitive
-- So 'Ä', 'A', and 'a' are all considered equal
de_CI_AI
-- Using back ticks is allowed, but unnecessary for builtin collations
`UTF8_BINARY`
Default collation
The default collation applies when using STRING
literals, parameter markers, functions without STRING
parameters producing strings, and when defining column, field or variable types without a COLLATE clause.
The default collation is derived in UTF8_BINARY
.
Collation precedence
To decide which collation to use for a given string Databricks defines collation precedence rules.
The rules assign 4 levels of precedence to collations:
-
Explicit
The collation has been explicitly assigned to a string using COLLATE expression.
Examples
SQL-- Force fast binary collation to check whether a vin matches a Ferrari
vin COLLATE UTF8_BINARY LIKE 'ZFF%'
-- Force German collation to order German first names
ORDER BY vorname COLLATE DE -
Implicit
The collation is implicitly assigned by the column, field, column-alias, variable, or routine parameter reference. This includes the result of a subquery as long as the collation is not None.
Examples
SQL-- Use the collation of the column as it was defined
employee.name LIKE 'Mc%'
-- Use the collation of the variable as it was defined.
translate(session.tempvar, 'Z', ',') -
Default
A
STRING
literal, named or unnamed parameter marker, or aSTRING
produced by a function from another type.Examples
SQL-- A literal string has the default collation
'Hello'
-- :parm1 is a parameter marker using session default collation
EXECUTE IMMEDIATE 'SELECT :parm1' USING 'Hello' AS parm1;
-- ? is a parameter marker using session default collation
EXECUTE IMMEDIATE 'SELECT ?' USING 'Hello';
-- The result of a cast of a non-STRING to a STRING is a STRING with the default collation
CAST(5 AS STRING)
-- The date is converted to a string using the default collation
to_char(DATE'2016-04-08', 'y')
-- The collation of the session_user STRING is the default collation
session_user()The assigned collation is the Default Collation.
-
None
A
STRING
result of a function, operator or set operation (e.g.UNION
) that takes more than oneSTRING
argument which have different implicit collations.Examples
SQL-- Concatenating two strings with different explicit collations results in no collation
SELECT fr || de AS freutsch FROM VALUES('Voulez vous ' COLLATE FR), 'Kartoffelsupp...' COLLATE DE) AS T(fr, de)
-- A union of two strings with different excplicit collations results in no collation
SELECT 'Voulez vous ' COLLATE FR UNION ALL SELECT 'Kartoffelsupp...' COLLATE DE
Collation derivation
When deriving the collation for a STRING
result, the collation precedence rules are applied in the following ways:
If the expression:
-
matches the definitions above
The collation and precedence is as defined.
-
is a function or operator with a single
STRING
parameter, returning aSTRING
The collation and precedence is that of the
STRING
parameter. -
is a function or operator with two or more
STRING
parameters-
with the same collations and precedence
The collation and precedence is that of the
STRING
parameters. -
with different collations or precedence
Let
C1
andC2
be distinct collations and letD
be the default collation. The precedence and the collation is determined by the following table:Collation and Precedence C1 Explicit C1 Implicit D Default None C2 Explicit Error C2 Explicit C2 Explicit C2 Explicit C2 Implicit Explicit C1 None C2 Implicit None D Default C1 Explicit C1 Implicit D Default None None C1 Explicit None None None
-
Examples
> SELECT 'hello' = 'hello ' COLLATE UNICODE_RTRIM;
true
> CREATE TABLE words(fr STRING COLLATE FR, de STRING COLLATE DE, en STRING COLLATE EN);
> INSERT INTO words VALUES ('Salut', 'Hallo', 'Hello');
-- A literal string has the default collation
> SELECT collation('Ciao');
UTF8_BINARY
-- A function producing a STRING has the default collation
> SELECT collation(user());
UTF8_BINARY
-- Function modifying a STRING passes the collation through
> SELECT collation(upper('Ciao'));
UTF8_BINARY
-- Implicit collation (French) wins over default collation
> SELECT collation(fr || 'Ciao') FROM words;
FR
-- Explicit collation (French) wins over implicit collation (German)
> SELECT collation('Salut' COLLATE FR || de) FROM words;
FR
-- Implicit collation German collides with implicit collation French
-- The result is no collation
> SELECT collation(de || fr) FROM words;
null
-- Explicit collation (French) wins over default collation (Italian)
> SELECT collation('Salut' COLLATE FR || 'Ciao');
FR
-- Explicit collation (French) collides with explicit collation (German)
> SELECT collation('Salut' COLLATE FR || 'Hallo' COLLATE DE);
COLLATION_MISMATCH.EXPLICIT
-- The explicit collation wins over no collation
> SELECT collation('Ciao' COLLATE IT || (fr || de)) FROM words;
IT
-- The implict collation (English) does not win over None
> SELECT collation(en || (fr || de)) FROM words;
null
-- The explicit collation (English) wins over Implicit collation anywhere in the expression
> SELECT collation((fr || ltrim('H' COLLATE EN, fr)) || fr) FROM words;
EN