Hivemall: Feature Scaling based on Min-Max values

Posted on Updated on

Once of the most common tasks when preparing data for data mining and machine learning is to take numerical data and scale it. Most enterprise and advanced tools and languages do this automatically for you, but with lower level languages you need to perform the task. There are a number of approaches to doing this. In this example we will use the Min-Max approach.

With the Min-Max feature scaling approach, we need to find the Minimum and Maximum values of each numerical feature. Then using a scaling function that will re-scale the data to a Zero to One range. The general formula for this is.

Screenshot 2019-03-18 09.58.49

Using the IRIS data set as the data set (and loaded in previous post), the first thing we need to find is the minimum and maximum values for each feature.

select min(features[0]), max(features[0]),
       min(features[1]), max(features[1]),
       min(features[2]), max(features[2]),
       min(features[3]), max(features[3])
from iris_raw;

we get the following results.

4.3  7.9  2.0  4.4  1.0  6.9  0.1  2.5

The format of the results can be a little confusing. What this list gives us is the results for each of the four features.

For feature[0], sepal_length, we have a minimum value of 4.3 and a maximum value of 7.9.

Similarly,

feature[1], sepal_width,  min=2.0, max=4.4

feature[2], petal_length,  min=1.0, max=6.9

feature[3], petal_width,  min=0.1, max=2.5

To use these minimum and maximum values, we need to declare some local session variables to store these.

set hivevar:feature0_min=4.3;
set hivevar:feature0_max=7.9;
set hivevar:feature1_min=2.0;
set hivevar:feature1_max=4.4;
set hivevar:feature2_min=1.0;
set hivevar:feature2_max=6.9;
set hivevar:feature3_min=0.1;
set hivevar:feature3_max=2.5;

After setting those variables we can now write a SQL SELECT and use the add_bias function to perform the calculations.

select rowid, label,
       add_bias(array(
          concat("1:", rescale(features[0],${f0_min},${f0_max})), 
          concat("2:", rescale(features[1],${f1_min},${f1_max})), 
          concat("3:", rescale(features[2],${f2_min},${f2_max})), 
          concat("4:", rescale(features[3],${f3_min},${f3_max})))) as features
from iris_raw;

and we get

> 1 Iris-setosa   ["1:0.22222215","2:0.625","3:0.0677966","4:0.041666664","0:1.0"]
> 2 Iris-setosa   ["1:0.16666664","2:0.41666666","3:0.0677966","4:0.041666664","0:1.0"]
> 3 Iris-setosa   ["1:0.11111101","2:0.5","3:0.05084745","4:0.041666664","0:1.0"]
...

Other feature scaling methods, available in Hivemall, include L1/L2 Normalization and zscore.