Transforming Outliers in Oracle Data Mining

Posted on Updated on

In previous posts I’ve shown how to use the DBMS_DATA_MINING.TRANSFORM function to transform data is various ways including, normalization and missing data. In this post I’ll build upon these to show how to outliers can be handled.

The following example will show you how you can transform data to identify outliers and transform them. In the example, Winsorsizing transformation is performed where the outlier values are replaced by the nearest value that is not an outlier.

The transformation process takes place in three stages. For the first stage a table is created to contain the outlier transformation data. The second stage calculates the outlier transformation data and store these in the table created in stage 1. One of the parameters to the outlier procedure requires you to list the attributes you do not the transformation procedure applied to (this is instead of listing the attributes you do want it applied to).  The third stage is to create a view (MINING_DATA_V_2) that contains the data set with the outlier transformation rules applied. The input data set to this stage can be the output from a previous transformation process (e.g. DATA_MINING_V).

   -- Clean-up : Drop the previously created tables
      execute immediate 'drop table TRANSFORM_OUTLIER';
      WHEN others THEN

   -- Stage 1 : Create the table for the transformations
   -- Perform outlier treatment for: AGE and YRS_RESIDENCE
      clip_table_name => 'TRANSFORM_OUTLIER');

   -- Stage 2 : Transform the categorical attributes
   --   Exclude the number attributes you do not want transformed
      clip_table_name => 'TRANSFORM_OUTLIER',
      data_table_name => 'MINING_DATA_V',
      tail_frac       => 0.025,

   -- Stage 3 : Create the view with the transformed data
      clip_table_name => 'TRANSFORM_OUTLIER',
      data_table_name => 'MINING_DATA_V',
      xform_view_name => 'MINING_DATA_V_2');

The view MINING_DATA_V_2 will now contain the data from the original data set transformed to process missing data for numeric and categorical data (from previous blog post), and also has outlier treatment for the AGE attribute.