Oracle Advanced Analytics
In my previous blog post I introduced the new Explicit Semantic Analysis (ESA) algorithm and gave an example of how you can build an ESA model and use it. Check out this link for that blog post.
In this blog post I will show you how you can manually create an ESA model. The reason that I’m showing you this way is that the workflow (in ODMr and it’s scheduler) may not be for everyone. You may want to automate the creation or recreation of the ESA model from time to time based on certain business requirements.
In my previous blog post I showed how you can setup a training data set. This comes with ODMr 4.2 but you may need to expand this data set or to use an alternative data set that is more in keeping with your domain.
Setup the ODM Settings table
As with all ODM algorithms we need to create a settings table. This settings table allows us to store the various parameters and their values, that will be used by the algorithm.
-- Create the settings table CREATE TABLE ESA_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(30)); -- Populate the settings table -- Specify ESA. By default, Naive Bayes is used for classification. -- Specify ADP. By default, ADP is not used. Need to turn this on. BEGIN INSERT INTO ESA_settings (setting_name, setting_value) VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_explicit_semantic_analys); INSERT INTO ESA_settings (setting_name, setting_value) VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_on); INSERT INTO ESA_settings (setting_name, setting_value) VALUES (odms_sampling,odms_sampling_disable); commit; END;
These are the minimum number of parameter setting needed to run the ESA algorithm. The other ESA algorithm setting include:
Setup the Oracle Text Policy
You also need to setup an Oracle Text Policy and a lexer for the Stopwords.
DECLARE v_policy_name varchar2(30); v_lexer_name varchar2(3) BEGIN v_policy_name := 'ESA_TEXT_POLICY'; v_lexer_name := 'ESA_LEXER'; ctx_ddl.create_preference(v_lexer_name, 'BASIC_LEXER'); v_stoplist_name := 'CTXSYS.DEFAULT_STOPLIST'; -- default stop list ctx_ddl.create_policy(policy_name => v_policy_name, lexer => v_lexer_name, stoplist => v_stoplist_name); END;
Create the ESA model
Once we have the settings table created with the parameter values set for the algorithm and the Oracle Text policy created, we can now create the model.
To ensure that the Oracle Text Policy is applied to the text we want to analyse we need to create a transformation list and add the Text Policy to it.
We can then pass the text transformation list as a parameter to the CREATE_MODEL, procedure.
DECLARE v_xlst dbms_data_mining_transform.TRANSFORM_LIST; v_policy_name VARCHAR2(130) := 'ESA_TEXT_POLICY'; v_model_name varchar2(50) := 'ESA_MODEL_DEMO_2'; BEGIN v_xlst := dbms_data_mining_transform.TRANSFORM_LIST(); DBMS_DATA_MINING_TRANSFORM.SET_TRANSFORM(v_xlst, '"TEXT"', NULL, '"TEXT"', '"TEXT"', 'TEXT(POLICY_NAME:'||v_policy_name||')(MAX_FEATURES:3000)(MIN_DOCUMENTS:1)(TOKEN_TYPE:NORMAL)'); DBMS_DATA_MINING.DROP_MODEL(v_model_name, TRUE); DBMS_DATA_MINING.CREATE_MODEL( model_name => v_model_name, mining_function => DBMS_DATA_MINING.FEATURE_EXTRACTION, data_table_name => 'WIKISAMPLE', case_id_column_name => 'TITLE', target_column_name => NULL, settings_table_name => 'ESA_SETTINGS', xform_list => v_xlst); END;
NOTE: Yes we could have merged all of the above code into one PL/SQL block.
Use the ESA model
We can now use the FEATURE_COMPARE function to use the model we just created, just like I did in my previous blog post.
SELECT FEATURE_COMPARE(ESA_MODEL_DEMO_2 USING 'Oracle Database is the best available for managing your data' text AND USING 'The SQL language is the one language that all databases have in common' text) similarity FROM DUAL;
Go give the ESA algorithm a go and see where you could apply it within your applications.
Oracle Data Miner 4.2 (part of SQL Developer 4.2) got released as an Early Adopter versions (EA) a few weeks ago.
With the new/updated Oracle Data Miner (ODMr) there are a number of new features. These can be categories as 1) features all ODMr users can use now, 2) New features that are only usable when using Oracle 12.2 Database, and 3) Updates to existing algorithms that have been exposed via the ODMr tool.
The following is a round up of the main new features you can enjoy as part of ODMr 4.2 (mainly covering points 1 and 2 above)
- You can now schedule workflows to run based on a defined schedule
- Support for additional data types (RAW, ROWID, UROWID, URITYPE)
- Better support for processing JSON data in the JSON Query node
- Additional insights are displayed as part of the Model Details View
- Additional alert monitoring and reporting
- Better support for processing in-memory data
- A new R Model node that allows you to include in-database ORE user defined R function to support model build, model testing and applying of new model.
- New Explicit Semantic Analysis node (Explicit Feature Extraction)
- New Feature Compare and Test nodes
- New workflow status profiling perfoance improvements
- Refresh the input data definition in nodes
- Attribute Filter node now allows for unsupervised attribute importance ranking
- The ability to build Partitioned data mining models
Look out for the blog posts on most of these new features over the coming months.
WARNING: Most of these new features requires an Oracle 12.2 Database.
In a previous blog post I talked about how you can rename and comment your Oracle Data Mining models. This is to allow you to easily to see and understand the intended use of the data mining model.
Another feature available to you is to audit the usage of the the data mining models. As your data mining environment grows to many 10s or more typically 100s of models, you will need to have some way of tracking their usage. This can allow you to discover what models are frequently being used and those that are not being used in-frequently. You can then use this information to investigate if there are any issues. Or in some companies I’ve seen an internal charging scheme in place for each time the models are used.
The following outlines the steps required to setup the auditing of your models and how to inspect the usage.
Note: You will need to the AUDIT_ADMIN role to audit the models.
First create an audit policy for the data mining model in a particular schema.
CREATE AUDIT POLICY oaa_odm_audit_usage ACTIONS ALL ON MINING MODEL dmuser.high_value_churn_clas_svm;
This creates a policy that monitors all activity on the data mining model HIGH_VALUE_CHURN_CLAS_SVM in the DMUSER schema.
Now we need to enable the policy and allow to to tract all activity on the model.
AUDIT POLICY oaa_odm_audit_usage BY oaa_model_user;
This will track all usage of the data mining model by the schema call OAA_MODEL_USER. We can then use the following query to search for the audit records for the OAA_MODEL_USER schema.
SELECT dbusername, action_name, systemm_privilege_used, return_code, object_schema, object_name, sql_text FROM unified_audit_trail WHERE object_name = 'HIGH_VALUE_CHURN_CLAS_SVM';
But there is a little problem with using what I’ve just shown you above. The problem is that it will track all activity on the data mining model. Perhaps this isn’t what we really want. Perhaps we only want to track only certain activity of the data mining model. Instead of creating the policy using ‘ACTIONS ALL’, we can list out the actions or operations we want to track. For example, we want to tract when it is used in a SELECT. The following shows how you can set this up for just SELECT.
CREATE AUDIT POLICY oaa_odm_audit_select ACTIONS SELECT ON MINING MODEL dmuser.high_value_churn_clas_svm; AUDIT POLICY oaa_odm_audit_select BY oaa_model_user;
The list of individual audit events you can use include:
A policy can be setup to tract one or more of these events. For example, if we wanted a policy to track SELECT and GRANT, we would have list each event separated by a comma.
CREATE AUDIT POLICY oaa_odm_audit_select_grant ACTIONS SELECT ON MINING MODEL dmuser.high_value_churn_clas_svm, ACTIONS GRANT ON MINING MODEL dmuser.high_value_churn_clas_svm, ; AUDIT POLICY oaa_odm_audit_select_grant BY oaa_model_user;
As your company evolves with their data mining projects, the number of models produced and in use in production will increase dramatically.
Care needs to be taken when it comes to managing these. This includes using meaningful names, adding descriptions of what the model is about or for, and being able to track their usage, etc.
I will look at tracking the usage of the models in another blog post, but the following gives examples of how to rename Oracle Data Mining models and how to add comments or descriptions to these models. This is particularly useful because our data analytics teams have a constant turn over or it has been many months since you last worked on a model and you want a quick idea of what purpose of the model was for.
If you have been using the Oracle Data Mining tool (part of SQL Developer) will will see your model being created with some sort of sequencing numbers. For example for a Support Vector Machine (SVM) model you might see it labelled for classification:
While you are working on this project you will know and understand what it was about and why it is being used. But afterward you may forget as you will be dealing with many hundreds of models. Yes you could check your documentation for the purpose of this model but that can take some time.
What if you could run a SQL query to find out?
But first we need to rename the model.
Next we will want to add a longer description of what the model is about. We can do this by adding a comment to the model.
COMMENT ON MINING MODEL high_value_churn_clas_svm IS 'Classification Model to Predict High Value Customers most likely to Churn';
We can now see these updated details when we query the Oracle Data Mining models in a user schema.
SELECT model_name, mining_function, algorithm, comments FROM user_mining_models;
These are two very useful commands.
If you are a user of the Oracle Data Miner tool (the workflow data mining tool that is part of SQL Developer), then you will have noticed that for many of the algorithms you can specify a Case Id attribute along with, say, the target attribute.
The idea is that you have one attribute that is a unique identifier for each case record. This may or may not be the case in your data model and you may have a multiple attribute primary key or case record identifier.
But what is the Case Id field used for in Oracle Data Miner?
Based on the documentation this field does not need to have a value. But it is recommended that you do identify an attribute for the Case Id, as this will allow for reproducible results. What this means is that if we run our workflow today and again in a few days time, on the exact same data, we should get the same results. So the Case Id allows this to happen. But how? Well it looks like the attribute used or specified for the Case Id is used as part of the Hashing algorithm to partition the data into a train and test data set, for classification problems.
So if you don’t have a single attribute case identifier in your data set, then you need to create one. There are a few options open to you to do this.
- Create one: write some code that will generate a unique identifier for each of your case records based on some defined rule.
- Use a sequence: and update the records to use this sequence.
- Use ROWID: use the unique row identifier value. You can write some code to populate this value into an attribute. Or create a view on the table containing the case records and add a new attribute that will use the ROWID. But if you move the data, then the next time you use the view then you will be getting different ROWIDs and that in turn will mean we may have different case records going into our test and training data sets. So our workflows will generate different results. Not what we want.
- Use ROWNUM: This is kind of like using the ROWID. Again we can have a view that will select ROWNUM for each record. Again we may have the same issues but if we have our data ordered in a way that ensures we get the records returned in the same order then this approach is OK to use.
- Use Identity Column: In Oracle 12c we have a new feature called Identify Column. This kind of acts like a sequence but we can defined an attribute in a table to be an Identity Column, and as records are inserted into the the data (in our scenario our case table) then this column will automatically generate a unique number for our data. Again if we need to repopulate the case table, you will need to drop and recreate the table to get the Identity Column to reset, otherwise the newly inserted records will start with the next number of the Identity Column
Here is an example of using the Identity Column in a case table.
CREATE TABLE case_table ( id_column NUMBER GENERATED ALWAYS AS IDENTITY, affinity_card NUMBER, age NUMBER, cust_gender VARCHAR2(5), country_name VARCHAR2(20) ... );
You can now use this Identity Column as the Case Id in your Oracle Data Miner workflows.
The Oracle 12.2c Database has been released and is currently available as a Cloud Service. The on-site version should be with us soon.
A few weeks ago I listed some of the new features that you will find in the Oracle Data Miner GUI tool (check out that blog post). I’ll have another blog post soon that looks a bit closer at how the new OAA features are exposed in this tool.
In this blog post I will list most of the new database related features in Oracle 12.2. There is a lot of new features and a lot of updated features. Over the next few months (yes it will take that long) I’ll have blog posts on most of these.
The Oracle Advanced Analytics Option new features include:
- The first new feature is one that you cannot see. Yes that sound a bit odd. But the underlying architecture of OAA has been rebuilt to allow for the algorithms to scale significantly. This is also future proofing OAA for new features coming in future releases of the database.
- Explicit Semantic Analysis. This is a new algorithm allows us to perform text similarity comparison. This is a great new addition and and much, much easier now compared to what we may have had to do previously.
- Using R models using SQL. Although we have been able to do this in the previous version of the database, the framework and supports have been extended to allow for greater and easier usage of user defined R scripts and R models with the in-database environment.
- Partitioned Models. We can now build partitioned mining models. This is where you can specify an attribute and a separate model will be created based on each value in the attribute.
- Partitioned scoring. Similarly we can now dynamically score the data based on an partition attribute.
- Extentions to Association Rules. Over the past few releases of the database, additional insights to the workings and decision making of the algorithms have been included. In 12.2 we now have some additional insights for the Association Rules aglorithm where we can now get to see the calculation of values associated with rules.
- DBMS_DATA_MINING package extended. This PL/SQL package has been extended to include the functionality for the new features listed above. Additional it can now process R algorithms and models.
- SQL Function changes: Change to the followi ODM related SQL functions to allow for partitioned models. CLUSTER_DETAILS, CLUSTER_DISTANCE, CLUSTER_ID, CLUSTER_PROBABILITY, CLUSTER_SET, FEATURE_COMPARE, FEATURE_DETAILS, FEATURE_ID, FEATURE_SET, FEATURE_VALUE, ORA_DM_PARTITION_NAME, PREDICTION, PREDICTION_BOUNDS, PREDICTION_COST, PREDICTION_DETAILS, PREDICTION_PROBABILITY, PREDICTION_SET
- New SQL Hint for ODM models. We have had hints in SQL for many, many versions now, but with 12.2c we now have a hint for partitioned models, called GROUPING hint.
- New CREATE_MODEL function. With the existing CREATE_MODEL function the input data set for the function needed to be defined in a table or accessed using a view. Basically the data needed to resist somewhere. With CREAETE_MODEL2 you can now define the input data set based on a SELECT statement.
In addition to all of these changes there are also some new interesting DB, SQL and PL/SQL new features that are of particular interest for your data science, machine learning, advanced analytics (or whatever the current favourite marketing term is today) projects.
It is going to be a busy few months ahead, working through all of these new features and write blog posts on how to use each of them.
In this 5th blog post in my series on using the capabilities of Oracle Text, Oracle R Enterprise and Oracle Data Mining to process documents and text, I will have a look at some of the machine learning features of Oracle Text.
Oracle Text comes with a number of machine learning algorithms. These can be divided into two types. The first is called ‘Supervised Learning’ where we have two machine learning algorithms for classification type of problem. The second type is called ‘Unsupervised Learning’ where we have the ability to use clustering machine learning algorithms to look for patterns in our text documents and to find similarities between documents based on their contents.
It is this second type of document clustering that I will work through in this blog post.
When using clustering with text documents, the machine learning algorithm will look for patterns that are common between the documents. These patterns will include the words used, the frequency of the words, the position or ordering of these words, the co-occurance of words, etc. Yes this is a large an complex task and that is why we need a machine learning algorithm to help us.
With Oracle Text we only have one clustering machine learning algorithm available to use. When we move onto using the Oracle Advanced Analytics Option (Oracle Data Mining and Oracle R Enterprise) we more algorithms available to us.
With Oracle Text the clustering algorithm is called k-Means. In a way the actual algorithm is unimportant as it is the only one available to us when using Oracle Text. To use this algorithm we have the CTX_CLS.CLUSTERING procedure. This procedure takes the documents we want to compare and will then identify the clusters (using hierarchical clustering) and will then tells us, for each document, what clusters the documents belong to and they probability value. With clustering a document (or a record) can belong to many clusters. Typically in the text books we see clusters that are very distinct and are clearly separated from each other. When you work on real data this is never the case. We will have many over lapping clusters and a data point/record can belong to one or more clusters. This is why we need the probability vale. We can use this to determine what cluster our record belongs to most and what other clusters it is associated with.
Using the example documents that I have been using during this series of blog posts we can use the CTX_CLS.CLUSTERING algorithm to cluster and identify similarities in these documents.
We need to setup the parameters that will be used by the CTX_CLS.CLUSTERING procedure. Tell it to use the k-Means algorithm and then the number of clusters to generate. As with all Oracle Text procedures or algorithms there are a number of settings you can configure or you can just accept the default values.
exec ctx_ddl.drop_preference('Cluster_My_Documents'); exec ctx_ddl.create_preference('Cluster_My_Documents','KMEAN_CLUSTERING'); exec ctx_ddl.set_attribute('Cluster_My_Documents','CLUSTER_NUM','3');
The code above is an example of the basics of what you need to setup for clustering. Other attribute or cluster parameter setting available to you include,
MAX_DOCTERMS, MAX_FEATURES, THEME_ON, TOKEN_ON, STEM_ON, MEMORY_SIZE and SECTION_WEIGHT.
Now we can run the CTX_CLS.CLUSTERING procedure on our documents. This procedure has the following parameters.
– The Oracle Text Index Name
– Document Id Column Name
– Document Assignment (cluster assignment) Table Name. This table will be created if it doesn’t already exist
– Cluster Description Table Name. This table will be created if it doesn’t already exist.
– Name of the Oracle Text Preference (list)
exec ctx_cls.clustering( 'MY_DOCUMENTS_OT_IDX', 'DOC_PK', 'OT_CLUSTER_RESULTS', 'DOC_CLUSTER_DETAILS', 'Cluster_My_Documents');
When the procedure has completed we can now examine the OT_CLUSTER_RESULTS and the DOC_CLUSTER_DETAILS tables. The first of these (OT_CLUSTER_RESULTS) allows us to see what documents have been clustered together. The following is what was produced for my documents.
SELECT d.doc_pk, d.doc_title, r.clusterid, r.score FROM my_documents d, ot_cluster_results r WHERE d.doc_pk = r.docid;
We can see that two of the documents have been grouped into the same cluster (ClusterId=2). If you have a look back at what these documents are about then you can see that yes these are very similar. For the other two documents we can see that they have been clustered into separate clusters (ClusterId=4 & 5). The clustering algorithms have said that they are different types of documents. Again when you examine these documents you will see that they are talking about different topics. So the clustering process worked !
You can also explore the various features of the clusters by looking that he DOC_CLUSTER_DETAILS table. Although the details in this table are not overly useful but it will give you some insight into what clusters the k-Means algorithm has produced.
Hopefully I’ve shown you how easy it is to setup and use the clustering feature of Oracle Text.
WARNING: Before using the Clustering or Classification with Oracle Text, you need to check with your local Oracle Sales representative about if there is licence implication. There seems to be some mentions the the algorithms used are those that come with Oracle Data Mining. Oracle Data Mining is a licence cost option for the database. So make sure you check before you go using these features.