Storing and processing Unicode characters in Oracle

Posted on Updated on

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems (Wikipedia). The standard is maintained by the Unicode Consortium, and contains over 137,994 characters (137,766 graphic characters, 163 format characters and 65 control characters).

The NVARCHAR2 is Unicode data type that can store Unicode characters in an Oracle Database. The character set of the NVARCHAR2 is national character set specified at the database creation time. Use the following to determine the national character set for your database.

SELECT *
FROM nls_database_parameters
WHERE PARAMETER = 'NLS_NCHAR_CHARACTERSET';

For my database I’m using an Oracle Autonomous Database. This query returns the character set AL16UTF16. This character set encodes Unicode data in the UTF-16 encoding and uses 2 bytes to store a character.

When creating an attribute with this data type, the size value (max 4000) determines the number of characters allowed. The actual size of the attribute will be double.

Let’s setup some data to test this data type.

CREATE TABLE demo_nvarchar2 (
   attribute_name NVARCHAR2(100));

INSERT INTO demo_nvarchar2 
VALUES ('This is a test for nvarchar2');

The string is 28 characters long. We can use the DUMP function to see the details of what is actually stored.

SELECT attribute_name, DUMP(attribute_name,1016)
FROM demo_nvarchar2;

The DUMP function returns a VARCHAR2 value that contains the datatype code, the length in bytes, and the internal representation of a value.

 

You can see the difference in the storage size of the NVARCHAR2 and the VARCHAR2 attributes.

Valid values for the return_format are 8, 10, 16, 17, 1008, 1010, 1016 and 1017. These values are assigned the following meanings:


8 – octal notation
10 – decimal notation
16 – hexadecimal notation
17 – single characters
1008 – octal notation with the character set name
1010 – decimal notation with the character set name
1016 – hexadecimal notation with the character set name
1017 – single characters with the character set name

The returned value from the DUMP function gives the internal data type representation. The following table lists the various codes and their description.

Code Data Type
1 VARCHAR2(size [BYTE | CHAR])
1 NVARCHAR2(size)
2 NUMBER[(precision [, scale]])
8 LONG
12 DATE
21 BINARY_FLOAT
22 BINARY_DOUBLE
23 RAW(size)
24 LONG RAW
69 ROWID
96 CHAR [(size [BYTE | CHAR])]
96 NCHAR[(size)]
112 CLOB
112 NCLOB
113 BLOB
114 BFILE
180 TIMESTAMP [(fractional_seconds)]
181 TIMESTAMP [(fractional_seconds)] WITH TIME ZONE
182 INTERVAL YEAR [(year_precision)] TO MONTH
183 INTERVAL DAY [(day_precision)] TO SECOND[(fractional_seconds)]
208 UROWID [(size)]
231 TIMESTAMP [(fractional_seconds)] WITH LOCAL TIMEZONE