Principal Component Analysis (PCA) in Oracle

Posted on Updated on

Principal Component Analysis (PCA), is a statistical process used for feature or dimensionality reduction in data science and machine learning projects. It summarizes the features of a large data set into a smaller set of features by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. There are lots of resources that goes into the mathematics behind this approach. I’m not going to go into that detail here and a quick internet search will get you what you need.

PCA can be used to discover important features from large data sets (large as in having a large number of features), while preserving as much information as possible.

Statistically, PCA finds lines, planes and hyper-planes in the K-dimensional space that approximate the data as well as possible in the least squares sense. A line or plane that is the least squares approximation of a set of data points makes the variance of the coordinates on the line or plane as large as possible.

Oracle has implemented PCA using Sigular Value Decomposition (SVD) on the covariance and correlations between variables, for feature extraction/reduction. PCA is closely related to SVD. PCA computes a set of orthonormal bases (principal components) that are ranked by their corresponding explained variance. The main difference between SVD and PCA is that the PCA projection is not scaled by the singular values. The extracted features are transformed features consisting of linear combinations of the original features.

When machine learning is performed on this reduced set of transformed features, it can completed with less resources and time, while still maintaining accuracy.

Algorithm Name in Oracle using

Mining Model Function = FEATURE_EXTRACTION

Algorithm = ALGO_SINGULAR_VALUE_DECOMP

(Hyper)-Parameters for algorithms

  • SVDS_U_MATRIX_OUTPUT : SVDS_U_MATRIX_ENABLE or SVDS_U_MATRIX_DISABLE
  • SVDS_SCORING_MODE : SVDS_SCORING_SVD or SVDS_SCORING_PCA
  • SVDS_SOLVER : possible values include SVDS_SOLVER_TSSVD, SVDS_SOLVER_TSEIGEN, SVDS_SOLVER_SSVD, SVDS_SOLVER_STEIGEN
  • SVDS_TOLERANCE : range of 0…1
  • SVDS_RANDOM_SEED : range of 0…4294967296 (!)
  • SVDS_OVER_SAMPLING : range of 1…5000
  • SVDS_POWER_ITERATIONS : Default value 2, with possible range of 0…20

Let’s work through an example using the MINING_DATA_BUILD_V data set that comes with Oracle Data Miner.

First step is to define the parameter settings for the algorithm. No data preparation is needed as the algorithm takes care of this. This means you can disable the Automatic Data Preparation (ADP).

-- create the parameter table
CREATE TABLE svd_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(4000));

-- define the settings for SVD algorithm
BEGIN 
   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_singular_value_decomp);

   -- turn OFF ADP
   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_off); 

   -- set PCA scoring mode
   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.svds_scoring_mode, dbms_data_mining.svds_scoring_pca);

   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.prep_shift_2dnum, dbms_data_mining.prep_shift_mean); 

   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.prep_scale_2dnum, dbms_data_mining.prep_scale_stddev); 
END;
/

You are now ready to create the model.

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'SVD_MODEL',
      mining_function     => dbms_data_mining.feature_extraction,
      data_table_name     => 'mining_data_build_v',
      case_id_column_name => 'CUST_ID',
      settings_table_name => 'svd_settings');
END;

When created you can use the mining model data dictionary views to explore the model and to explore the specifics of the model and the various MxN matrix created using the model specific views. These include:

  • DM$VESVD_Model : Singular Value Decomposition S Matrix
  • DM$VGSVD_Model : Global Name-Value Pairs
  • DM$VNSVD_Model : Normalization and Missing Value Handling
  • DM$VSSVD_Model : Computed Settings
  • DM$VUSVD_Model : Singular Value Decomposition U Matrix
  • DM$VVSVD_Model : Singular Value Decomposition V Matrix
  • DM$VWSVD_Model : Model Build Alerts

Where the S, V and U matrix contain:

  • U matrix : consists of a set of ‘left’ orthonormal bases
  • S matrix : is a diagonal matrix
  • V matrix : consists of set of ‘right’ orthonormal bases

These can be explored using the following

-- S matrix
select feature_id, VALUE, variance, pct_cum_variance 
from DM$VESVD_MODEL;

-- V matrix
select feature_id, attribute_name, value
from DM$VVSVD_MODEL
order by feature_id, attribute_name;

-- U matrix
select feature_id, attribute_name, value
from DM$VVSVD_MODEL
order by feature_id, attribute_name;

To determine the projections to be used for visualizations we can use the FEATURE_VALUES function.

select FEATURE_VALUE(svd_sh_sample, 1 USING *) proj1, 
       FEATURE_VALUE(svd_sh_sample, 2 USING *) proj2
from   mining_data_build_v 
where  cust_id <= 101510
order by 1, 2;

 

Other algorithms available in Oracle for feature extraction and reduction include:

  • Non-Negative Matrix Factorization (NMF)
  • Explicit Semantic Analysis (ESA)
  • Minimum Description Length (MDL) – this is really feature selection rather than feature extraction