Function
Approximate Count Distinct (12.1.0.2 new feature)
With the release of the Oracle Database 12.1.0.2 there was a number of new features and options. Most of the publicity has been around the in-Memory option. But there was lots of other features for the DBA and a few for the developer.
One of the new SQL functions is the APPROX_COUNT_DISTINCT(). This function is different to the tradition count distinct, COUNT(DISTINCT expression), in that is performs an approximate count distinct. The theory is that this approximate count is a lot more efficient than performing the full count distinct.
The APPROX_COUNT_DISTINCT() function is really only suitable when you are processing very large volumes of data and when the data set contains a large number of distinct values.
The general syntax of the function is:
… APPROX_COUNT_DISTINCT(expression) …
and returns a Number.
The function returns the approximate number of records that contain distinct value for the expression.
SELECT approx_count_distinct(cust_id)
FROM mining_data_build_v;
The APPROX_COUNT_DISTINCT() function ignores records that contain a null value for the expression. Plus is performs less work on the sorting and aggregations. Just run and Explain Plan and you can see the differences.
In some of the material from Oracle the APPROX_COUNT_DISTINCT() function can be 5x to 50x++ times faster. But it depends on the number of distinct values and the complexity of the SQL query.
As the result / returned value from the function may not be 100% accurate, Oracle says that the functions has an accuracy of >97% (with 95% confidence).
The function cannot be used on the following data types: BFILE, BLOB, CLOB, LONG, LONG RAW and NCLOB
Tokenizing a String : Using Regular Expressions
In my previous blog post I gave some PL/SQL that performed the tokenising of a string. Check out this blog post here.
Thanks also to the people who sent me links examples of how to tokenise a string using the MODEL clause. Yes there are lots of examples of this out there on the interest.
While performing the various searches on the internet I did come across some examples of using Regular Expressions to extract the tokens. The following example is thanks to a blog post by Tanel Poder
I’ve made some minor changes to it to remove any of the special characters we want to remove.
column token format a40
define separator=” “
define mystring=”$My OTN LA Tour (2014?) will consist of Panama, CostRica and Mexico.”
define myremove=”\?|\#|\$|\.|\,|\;|\:|\&|\(|\)|\-“;
SELECT regexp_replace(REGEXP_REPLACE(
REGEXP_SUBSTR( ‘&mystring’||’&separator’, ‘(.*?)&separator’, 1, LEVEL )
, ‘&separator$’, ”), ‘&myremove’, ”) TOKEN
FROM
DUAL
CONNECT BY
REGEXP_INSTR( ‘&mystring’||’&separator’, ‘(.*?)&separator’, 1, LEVEL ) > 0
ORDER BY
LEVEL ASC
/
When we run this code we get the following output.
So we have a number of options open to use to tokenise strings using SQL and PL/SQL, using a number of approaches including substring-ing, using pipelined functions, using the Model clause and also using Regular Expressions.
Tokenizing a String
Over the past while I’ve been working a lot with text strings. Some of these have been short in length like tweets from Twitter, or longer pieces of text like product reviews. Plus others of various lengths.
In all these scenarios I have to break up the data into individual works or Tokens.
The examples given below illustrate how you can take a string and break it into its individual tokens. In addition to tokenising the string I’ve also included some code to remove any special characters that might be included with the string.
These include ? # $ . ; : &
This list of special characters to ignore are just an example and is not an exhaustive list. You can add whatever characters to the list yourself. To remove these special characters I’ve used regular expressions as this seemed to be the easiest way to do this.
Using PL/SQL
The following example shows a simple PL/SQL unit that will tokenise a string.
DECLARE
vDelimiter VARCHAR2(5) := ‘ ‘;
vString VARCHAR2(32767) := ‘Hello Brendan How are you today?’||vDelimiter;
vPosition PLS_INTEGER;
vToken VARCHAR2(32767);
vRemove VARCHAR2(100) := ‘\?|\#|\$|\.|\,|\;|\:|\&’;
vReplace VARCHAR2(100) := ”;
BEGIN
dbms_output.put_line(‘String = ‘||vString);
dbms_output.put_line(”);
dbms_output.put_line(‘Tokens’);
dbms_output.put_line(‘————————‘);
vPosition := INSTR(vString, vDelimiter);
WHILE vPosition > 0 LOOP
vToken := LTRIM(RTRIM(SUBSTR(vString, 1, vPosition-1)));
vToken := regexp_replace(vToken, vRemove, vReplace);
vString := SUBSTR(vString, vPosition + LENGTH(vDelimiter));
dbms_output.put_line(vPosition||’: ‘||vToken);
vPosition := INSTR(vString, vDelimiter);
END LOOP;
END;
/
When we run this (with Serveroutput On) we get the following output.
A slight adjustment is needed to the output of this code to remove the numbers or positions of the token separator/delimiter.
Tokenizer using a Function
To make this more usable we will really need to convert this into an iterative function. The following code illustrates this, how to call the function and what the output looks like.
CREATE OR replace TYPE token_list
AS TABLE OF VARCHAR2(32767);
/
CREATE OR replace FUNCTION TOKENIZER(pString IN VARCHAR2,
pDelimiter IN VARCHAR2)
RETURN token_list pipelined
AS
vPosition INTEGER;
vPrevPosition INTEGER := 1;
vRemove VARCHAR2(100) := ‘\?|\#|\$|\.|\,|\;|\:|\&’;
vReplace VARCHAR2(100) := ”;
vString VARCHAR2(32767) := regexp_replace(pString, vRemove, vReplace);
BEGIN
LOOP
vPosition := INSTR (vString, pDelimiter, vPrevPosition);
IF vPosition = 0 THEN
pipe ROW (SUBSTR(vString, vPrevPosition ));
EXIT;
ELSE
pipe ROW (SUBSTR(vString, vPrevPosition, vPosition – vPrevPosition ));
vPrevPosition := vPosition + 1;
END IF;
END LOOP;
END TOKENIZER;
/
Here are a couple of examples to show how it works and returns the Tokens.
SELECT column_value TOKEN
FROM TABLE(tokenizer(‘It is a hot and sunny day in Ireland.’, ‘ ‘))
, dual;
How if we add in some of the special characters we should see a cleaned up set of tokens.
SELECT column_value TOKEN
FROM TABLE(tokenizer(‘$$$It is a hot and sunny day in #Ireland.’, ‘ ‘))
, dual;
Adding numbers between two values
My son asked me the other day, what is the total number if you add all the numbers between Zero and 100.
We could have sat down to work it out with some paper and a pen, but instead I decided to introduce him to the world of Oracle, SQL and PL/SQL
The first step we took was to work out how you would do it on paper for some of the numbers. Then we translated this into some PL/SQL code. OK I did a lot this but he did seem to understand and follow what I was doing.
So the following Function is what we ended up with to add all the numbers between two numbers and return the answer.
CREATE or REPLACE function AddNumsBetween
( pStartNum IN NUMBER,
pEndNum IN NUMBER)
RETURN NUMBER
IS
vSum Number := 0;
BEGIN
FOR i IN pStartNum .. pEndNum LOOP
vSum := vSum + i;
END LOOP;
return vSum;
END;
/
The next step was to write some code to call this function. The code prompts the user to enter the Start number and End number.
set serveroutput on
DECLARE
vStartNum NUMBER := 0;
vEndNum NUMBER := 100;
vAnswer NUMBER := 0;
BEGIN
vStartNum := &Start_Number;
vEndNum := &End_Number;
vAnswer := AddNumsBetween(vStartNum, vEndNum);
dbms_output.put_line(‘The sum of numbers between ‘||vStartNum||’ and ‘||vEndNum||’ is ‘||vAnswer||’.’);
END;
/
To answer by son’s original query, we used Zero and 100 as our inputs.
The answer to the question is 5,050.
ASCII to character conversion in Oracle
Here is code code that will produce formatted output of the characters and their ascii values. The formatting is broken into lower case letters, uppercase letters, characters with an ascii value less than an ‘a’ and characters whose ascii characters are greater than a ‘z.
Code
set serveroutput on FORMAT WRAPPED
DECLARE
vTab VARCHAR2(5) := CHR(9);
vNum NUMBER := 0;
vString VARCHAR2(80) := ”;
BEGIN
—
— Formatted lower case letter to ASCII values
—
dbms_output.put_line(‘Formatted Lower Case Letters to ASCII values’);
dbms_output.put_line(‘——————————————————-‘);
FOR i IN ASCII(‘a’) .. ASCII(‘z’) LOOP
IF vNum < 6 THEN
vString := vString||CHR(i)||’ : ‘||i||vTab;
vNum := vNum + 1;
ELSIF vNum = 6 then
dbms_output.put_line(vString||CHR(i)||’ : ‘||i);
vNum := 0;
vString := ”;
ELSE
dbms_output.put_line(‘ERROR’);
END IF;
END LOOP;
dbms_output.put_line(vString);
—
— Formatted upper case letter to ASCII values
—
vString := ”;
vNum := 0;
dbms_output.new_line;
dbms_output.new_line;
dbms_output.put_line(‘Formatted Upper Case Letters to ASCII values’);
dbms_output.put_line(‘——————————————————-‘);
FOR i IN ASCII(‘A’) .. ASCII(‘Z’) LOOP
IF vNum < 6 THEN
vString := vString||CHR(i)||’ : ‘||i||vTab;
vNum := vNum + 1;
ELSIF vNum = 6 then
dbms_output.put_line(vString||CHR(i)||’ : ‘||i);
vNum := 0;
vString := ”;
ELSE
dbms_output.put_line(‘ERROR’);
END IF;
END LOOP;
dbms_output.put_line(vString);
—
— Formatted chars less than ‘a’ to ASCII values
—
vString := ”;
vNum := 0;
dbms_output.new_line;
dbms_output.new_line;
dbms_output.put_line(‘Formatted Letters, less than a to ASCII values’);
dbms_output.put_line(‘——————————————————-‘);
FOR i in 0 .. ASCII(‘a’)-1 LOOP
IF vNum < 6 THEN
vString := vString||CHR(i)||’ : ‘||i||vTab;
vNum := vNum + 1;
ELSIF vNum = 6 then
dbms_output.put_line(vString||CHR(i)||’ : ‘||i);
vNum := 0;
vString := ”;
ELSE
dbms_output.put_line(‘ERROR’);
END IF;
END LOOP;
dbms_output.put_line(vString);
—
— Formatted chars greater than ‘Z’ to ASCII values
—
vNum := 0;
vString := ”;
dbms_output.new_line;
dbms_output.new_line;
dbms_output.put_line(‘Formatted Letters, greater than z to ASCII values’);
dbms_output.put_line(‘——————————————————-‘);
FOR i IN ASCII(‘z’) .. ASCII(‘z’)+133 LOOP
IF vNum < 6 THEN
vString := vString||CHR(i)||’ : ‘||i||vTab;
vNum := vNum + 1;
ELSIF vNum = 6 then
dbms_output.put_line(vString||CHR(i)||’ : ‘||i);
vNum := 0;
vString := ”;
ELSE
dbms_output.put_line(‘ERROR’);
END IF;
END LOOP;
dbms_output.put_line(vString);
END;
/
Output
Formatted Lower Case Letters to ASCII values
——————————————————-
a : 97 b : 98 c : 99 d : 100 e : 101 f : 102 g : 103
h : 104 i : 105 j : 106 k : 107 l : 108 m : 109 n : 110
o : 111 p : 112 q : 113 r : 114 s : 115 t : 116 u : 117
v : 118 w : 119 x : 120 y : 121 z : 122
Formatted Upper Case Letters to ASCII values
——————————————————-
A : 65 B : 66 C : 67 D : 68 E : 69 F : 70 G : 71
H : 72 I : 73 J : 74 K : 75 L : 76 M : 77 N : 78
O : 79 P : 80 Q : 81 R : 82 S : 83 T : 84 U : 85
V : 86 W : 87 X : 88 Y : 89 Z : 90
Formatted Letters, less than a to ASCII values
——————————————————-
: 0 ☺ : 1 ☻ : 2 ♥ : 3 ♦ : 4 ♣ : 5 ♠ : 6
: 7 : 8 : 9
: 13 ♂ : 11 ♀ : 12
♫ : 14 ☼ : 15 ► : 16 ◄ : 17 ↕ : 18 ‼ : 19 ¶ : 20
§ : 21 ▬ : 22 ↨ : 23 ↑ : 24 ↓ : 25 → : 26 ← : 27
∟ : 28 ↔ : 29 ▲ : 30 ▼ : 31 : 32 ! : 33 ” : 34
# : 35 $ : 36 % : 37 & : 38 ‘ : 39 ( : 40 ) : 41
* : 42 + : 43 , : 44 – : 45 . : 46 / : 47 0 : 48
1 : 49 2 : 50 3 : 51 4 : 52 5 : 53 6 : 54 7 : 55
8 : 56 9 : 57 : : 58 ; : 59 : 62
? : 63 @ : 64 A : 65 B : 66 C : 67 D : 68 E : 69
F : 70 G : 71 H : 72 I : 73 J : 74 K : 75 L : 76
M : 77 N : 78 O : 79 P : 80 Q : 81 R : 82 S : 83
T : 84 U : 85 V : 86 W : 87 X : 88 Y : 89 Z : 90
[ : 91 \ : 92 ] : 93 ^ : 94 _ : 95 ` : 96
Formatted Letters, greater than z to ASCII values
——————————————————-
z : 122 { : 123 | : 124 } : 125 ~ : 126 ⌂ : 127 Ç : 128
ü : 129 é : 130 â : 131 ä : 132 à : 133 å : 134 ç : 135
ê : 136 ë : 137 è : 138 ï : 139 î : 140 ì : 141 Ä : 142
Å : 143 É : 144 æ : 145 Æ : 146 ô : 147 ö : 148 ò : 149
û : 150 ù : 151 ÿ : 152 Ö : 153 Ü : 154 ø : 155 £ : 156
Ø : 157 × : 158 ƒ : 159 á : 160 í : 161 ó : 162 ú : 163
ñ : 164 Ñ : 165 ª : 166 º : 167 ¿ : 168 ® : 169 ¬ : 170
½ : 171 ¼ : 172 ¡ : 173 « : 174 » : 175 ░ : 176 ▒ : 177
▓ : 178 │ : 179 ┤ : 180 Á : 181 Â : 182 À : 183 © : 184
╣ : 185 ║ : 186 ╗ : 187 ╝ : 188 ¢ : 189 ¥ : 190 ┐ : 191
└ : 192 ┴ : 193 ┬ : 194 ├ : 195 ─ : 196 ┼ : 197 ã : 198
à : 199 ╚ : 200 ╔ : 201 ╩ : 202 ╦ : 203 ╠ : 204 ═ : 205
╬ : 206 ¤ : 207 ð : 208 Ð : 209 Ê : 210 Ë : 211 È : 212
ı : 213 Í : 214 Î : 215 Ï : 216 ┘ : 217 ┌ : 218 █ : 219
▄ : 220 ¦ : 221 Ì : 222 ▀ : 223 Ó : 224 ß : 225 Ô : 226
Ò : 227 õ : 228 Õ : 229 µ : 230 þ : 231 Þ : 232 Ú : 233
Û : 234 Ù : 235 ý : 236 Ý : 237 ¯ : 238 ´ : 239 : 240
± : 241 ‗ : 242 ¾ : 243 ¶ : 244 § : 245 ÷ : 246 ¸ : 247
° : 248 ¨ : 249 · : 250 ¹ : 251 ³ : 252 ² : 253 ■ : 254
: 255
PL/SQL procedure successfully completed.
Observations
There are two things that stand out in this. The first is there is sound produced. This is because one of the characters is defined this way. It is ASCII number 7. This can be repeated using the following:
select chr(7) from dual.
The second is the formatting of the lines for ascii codes 8 to 12. We can see that one of the ascii codes does not get displayed and the ordering of this is not as expected. This is due to ascii 10 being a line feed.