Storing and processing Unicode characters in Oracle

Posted on February 17, 2020 Updated on February 13, 2020

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems (Wikipedia). The standard is maintained by the Unicode Consortium, and contains over 137,994 characters (137,766 graphic characters, 163 format characters and 65 control characters).

The NVARCHAR2 is Unicode data type that can store Unicode characters in an Oracle Database. The character set of the NVARCHAR2 is national character set specified at the database creation time. Use the following to determine the national character set for your database.

SELECT *
FROM nls_database_parameters
WHERE PARAMETER = 'NLS_NCHAR_CHARACTERSET';

For my database I’m using an Oracle Autonomous Database. This query returns the character set AL16UTF16. This character set encodes Unicode data in the UTF-16 encoding and uses 2 bytes to store a character.

When creating an attribute with this data type, the size value (max 4000) determines the number of characters allowed. The actual size of the attribute will be double.

Let’s setup some data to test this data type.

CREATE TABLE demo_nvarchar2 (
   attribute_name NVARCHAR2(100));

INSERT INTO demo_nvarchar2 
VALUES ('This is a test for nvarchar2');

The string is 28 characters long. We can use the DUMP function to see the details of what is actually stored.

SELECT attribute_name, DUMP(attribute_name,1016)
FROM demo_nvarchar2;

The DUMP function returns a VARCHAR2 value that contains the datatype code, the length in bytes, and the internal representation of a value.

You can see the difference in the storage size of the NVARCHAR2 and the VARCHAR2 attributes.

Valid values for the return_format are 8, 10, 16, 17, 1008, 1010, 1016 and 1017. These values are assigned the following meanings:

8 – octal notation
10 – decimal notation
16 – hexadecimal notation
17 – single characters
1008 – octal notation with the character set name
1010 – decimal notation with the character set name
1016 – hexadecimal notation with the character set name
1017 – single characters with the character set name

The returned value from the DUMP function gives the internal data type representation. The following table lists the various codes and their description.

Code	Data Type
1	VARCHAR2(size [BYTE \| CHAR])
1	NVARCHAR2(size)
2	NUMBER[(precision [, scale]])
8	LONG
12	DATE
21	BINARY_FLOAT
22	BINARY_DOUBLE
23	RAW(size)
24	LONG RAW
69	ROWID
96	CHAR [(size [BYTE \| CHAR])]
96	NCHAR[(size)]
112	CLOB
112	NCLOB
113	BLOB
114	BFILE
180	TIMESTAMP [(fractional_seconds)]
181	TIMESTAMP [(fractional_seconds)] WITH TIME ZONE
182	INTERVAL YEAR [(year_precision)] TO MONTH
183	INTERVAL DAY [(day_precision)] TO SECOND[(fractional_seconds)]
208	UROWID [(size)]
231	TIMESTAMP [(fractional_seconds)] WITH LOCAL TIMEZONE

This entry was posted in Oracle, SQL and tagged Oracle, SQL.

Ora-lytics

By Brendan Tierney

Storing and processing Unicode characters in Oracle

Storing and processing Unicode characters in Oracle

Share this:

Related