4.3. Learn

The ML Learn Plugin allows for a Machine Learning Model to be trained.

python bin/lucy.egg plugins ml_learn -h

Primary Arguments

Main specifications for a Learn task.

Argument Detail
-input PATH. File path to the input data.
-output PATH. File path to where the output summary file will be saved. It is recommend that this ‘file be saved as a JSON.
-mfile PATH. File path to where the resulting Pickled model fill will be saved. This is a serialized pickle (.pkl) file.
-clabel STR. Used with ‘-classify’ and ‘-regression’. The name of the target/response/output column.

Preprocessing

Manipulation of the data before using it.

Argument Detail
-autoclean NONE. The use of this command cleans the data based on a pre-defined set of rules.
-independents COMMA SEPARATED LIST. A subset of column names that will be used for Learning. The subset will automatically include a ‘-clabel’ if specified.
-normalize METHOD. (‘standard’, ‘minmax’) Normalizes the columns of ALL numerical data using the selected scaling method.
-degree INT. (1 - 5) Generates new input data to be polynomial combinations of the original input dataset. This transformation can be used when the input data is numerical.
-cv METHOD. (‘yes’, ‘no’) Turn on Cross-validation. Currently, option ‘yes’ enables K-fold cross-validation.
-cval INT. Used with ‘-cv’ when the K-fold Cross-Validation has been selected. The value given is the ‘K’ number of folds.

Cluster

Unsupervised learning methods used to train models that put observations into similar categories.

Argument Detail
-cluster ALGO. (‘kmeans’, ‘meanshift’) Unsupervised Learning algorithms that groups similar data points.
-nc INT. Used with ‘-cluster kmeans’. The number of clusters that the observations will be put into based on found similarity. The default value is 3.
-bandwidth FLOAT. Used with ‘-cluster meanshift’. Reflects the distribution of the observations. The bandwidth selected must be done while being aware of range(s) of the dataset. A dataset that has been normalized using a minmax scalar will have numerical values between 0 and 1. This means that the bandwidth should also be between 0 and 1.

Classify

Supervised learning methods used to train models based on categorical target values.

Argument Detail
-classify ALGO. (‘dtree’, ‘rfc’, ‘gnb’, ‘logistic’) Learns a classification model using the algorithm given.
-cval FLOAT. (0 - 100) Used with ‘-classify dtree’, ‘-classify rfc’, or ‘-classify gnb’. Percentage of the data to be used for cross-validation. The default value is 1.
-criterion CRITERION. (‘gini’, ‘entropy’) Used with ‘-classify dtree’ and ‘-classify rfc’. The method to measure the quality of the split. The default is ‘gini’.
-nestimators INT. Used with ‘-classify rfc’. The number of trees in the forest. The default value is 10.

Regression

Supervised learning methods used to train models based on numerical target values.

Argument Detail
-regression ALGO. (‘linear’, ‘lasso’, ‘ridge’, ‘svr’, ‘rfr’, ‘dtreereg’, ‘gpr’) Learns a regression model using the algorithm given.
-grid-response NONE. Used with ‘-regression’. Works for input feature space sizes of 1 or 2. With the response, this results in a list of points that are 2D or 3D. Input prediction points are arranged based on the range(s) of the input values. The idea is that ‘-grid-response’ generates points that can be used to construct the regression line or a surface. Currently provides 10^p points, where ‘p’ is the number of input parameters.
-optimize METHOD. (‘min’, ‘max’) Uses the model to determine the inputs required to obtain a prediction value that meets the optimize method selected. Used with ‘-regression linear/lasso/ridge’.

Learn Examples

Input

A Learn routine begins by choosing the file path of where the input data is to be read from.

python bin/lucy.egg plugins ml_learn -input ./input_data.csv

Output

A Learn routine ends where the output data is to be saved. This file is in a json format and is packed with information. That is useful to the user and will contain the information necessary to reconstruct the Learned model without the requirement of python.

python bin/lucy.egg plugins ml_learn -input ./input_data.csv -output ./out.json

Preprocessing

Dimension Reduction

Dimensionality Reduction is the process transforming data to a lower dimensional space. Ideally, this lower dimensional space will retain many of the same properties as the original data set.

Feature Importance

Lucy uses two metrics for measuring Feature Importance given only a Target column. Currently, these metrics can only be obtained when running a Lucy Learn routine.

  • Correlation - The correlation of each numeric input variable and a numeric Target
  • Gini Importance - The importance measure of each input variable for a given target.

To obtain these, the option -feature-importance is used with no argument required. These values are populate when specifying a -clabel and are saved in the Lucy-Response JSON when specify a path for the Lucy-Response using -output.

Consider the iris dataset and the following command:

python bin/lucy.egg plugins ml_learn -input iris.csv -output iris_output.json -clabel class -classify dtree -feature-importance

The Lucy-Response is populated wth the following scores:

{
    ...
    "payload": {
        ...
        "feature_importance": {
            "gini_importance": {
                "sepal_width": 0.006145984326963746,
                "petal_width": 0.4852350158451978,
                "sepal_length": 0.0059189255008685404,
                "petal_length": 0.50270007432697
            }
        }
    }
}

Dimensional Reduction Suggestions

-dim-reduct

Principal Component Analysis

PCA is a linear dimensionality reduction technique that uses Singular Value Decomposition to project the original data to a lower dimensional space. Lucy keeps the principal components that explain 99% of the variance of the original data. These components are then used to transform the original data to a space of a lower dimension. The resulting columns do not share any direct relation to the features of the original data.

python bin/lucy.egg plugins ml_learn -input ./input.csv -output ./out.json -regression linear -label target -pca

The resulting PCA dataset is saved under ‘pca_dataset’ in the output file. This can be used in place of the original input. This PCA dataset would need to be saved as a csv and a new instance of Learn would need to be initiated where the input data is the pca_dataset along with the copied over target column.

Note

Currently, these dimensional reduction insight options can only be used when running a successful Learn process. The dimension reduction results are not used. They are options for the user to consider.

Independents

-independents

Normalize

-normalize

Autoclean

Autoclean is a way to clean a dataset before before performing the Learn task. Details of the exact procedures performed by Autoclean can be found at Clean’s Autoclean

The argument -autoclean is used in order to clean the dataset before the Learn task is applied.

python bin/lucy.egg plugins ml_learn -input ./input_data.csv -output ./out.json -regression linear -clabel target -autoclean

Cross-Validation

The -cv option allows a user to specify what Cross-validation method will be used. More information on Cross-Validation can be found here.

Additionally, cval is used to specify the number of folds for K-folds Cross-Validation.

Clustering

Clustering is a unsupervised learning technique that discovers natural grouping in a dataset.

K-Means

The K-Means clustering algorithm groups samples into a user-defined p number of groups. The group a sample belongs to is based on what centroid it is closest to. There are p centroids. The location of these centroids are found through interation from the K-Means algorithm. The algorithm scales well with a large number of samples.

Consider the Iris Dataset containing 3 different classes:

sepal_length sepal_width petal_length petal_width class
5.1 3.5 1.4 0.2 0
7 3.2 4.27 1.4 1
6.3 3.3 6 2.5 2

Plotting the petal length vs. the petal width shows a fairly clear separation in the three class.

_images/iris_cluster.png

Now let’s use K-Means clustering to visually compare these results. It is recommended number of cluster be fit to the problem in mind. This can be accomplished using -nc.

python bin/lucy.egg plugins ml_learn -input ./input_data.csv -output ./out.json -independents petal_length,petal_width -cluster kmeans -nc 3

The resulting group assignments are saved in the Lucy-Response JSON file saved at the -output path. The found labels are saved as a dictionary where the key is the id (or the order) in which the the input points were provided to the LEARN routine. The number of unique values these points are assigned to is equal to the -nc provided. They location where these can be found in the Lucy-Response is shown below:

{
    ...
    "payload": {
        ...
        "model_dict": {
            "labels_": {
                "139": 0,
                "138": 2,
                "24": 1,
                ...
            }
        }
    }
}

‘payload’ -> ‘model_dict’ -> ‘labels_’.

To visualize the results, the samples can be assigned the groups they have been given from the K-Means clustering in the same order the input data was given. The results can be seen below.

_images/iris_kmeans.png

Additionally, the location of the centroids can be found in the output JSON at ‘payload’ -> ‘model_dict’ -> ‘cluster_centers_’. For this example the centroids were found to be

cluster_centers_ = [
    [1.4639999999999995,
     0.24400000000000022],
    [4.269230769230769,
     1.3423076923076924],
    [5.595833333333332,
     2.0374999999999996]
]

Note

There are no labels associated with the result. This is an unsupervised learning algorithm. The samples are grouped but the groups are not assigned any labels since there are no labels to learn from.

Mean Shift

MeanShift clustering aims to discover blobs in a smooth density of samples. It is a centroid based algorithm that works by updating candidates for centroids to be the mean of the points within a given region called the bandwidth. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.

Consider the Iris Dataset containing 3 different classes:

sepal_length sepal_width petal_length petal_width class
5.1 3.5 1.4 0.2 0
7 3.2 4.27 1.4 1
6.3 3.3 6 2.5 2
       

Plotting the petal length vs. the petal width shows a fairly clear separation in the three class.

_images/iris_cluster.png

Now let’s use Mean Shift clustering to visually compare its results with the true category separation. A bandwidth should be specified. It is recommended that the data be normalized before performing the algorithm. This will narrow the required bandwidth to be between 0 and 1.

python bin/lucy.egg plugins ml_learn -input ./input_data.csv -output ./out.json -independents petal_length,petal_width -cluster meansshift -bandwidth 0.5 -normalize standard

The resulting group assignments can be found in the ouput JSON at the location ‘payload’ -> ‘model_dict’ -> ‘labels_’. To visualize the results, the samples can be assigned the groups they have been given from the Mean Shift clustering in the same order the input data was given. The results can be seen below.

_images/iris_meanshift.png

Additionally, the location of the centroids can be found in the output JSON at ‘payload’ -> ‘model_dict’ -> ‘cluster_centers_’. For this example the centroids are reflective of the normalized data and are found to be

cluster_centers_ = [
    [-1.3005214861029293,
     -1.2509378621062448],
    [0.37427641672771905,
     0.24821582279821572],
    [0.8873031279121748,
     1.0261947088710563]
]

Note

There ended up being 3 total clusters but this won’t always be the case. The number of clusters is based on the bandwidth. As bandwidth increases the number of clusters will decrease.

Classification

Supervised learning methods used to train models based on categorical target values.

WIP

Decision Tree Classifier

The trained model is a single Decision Tree whose target column is categorical.

Consider the Iris Dataset containing 3 different classes:

sepal_length sepal_width petal_length petal_width class
5.1 3.5 1.4 0.2 0
7 3.2 4.27 1.4 1
6.3 3.3 6 2.5 2
       

There are a few requirements when performing classification using a Decision Tree. We need -input data, to specify a -clabel, and use -classify dtree. Additionally, we will need to specify a mfile if we want to save the resulting model to use later in PREDICT. If we want some readable information about the model, we can specify -output to save an informative Lucy-Response in a JSON format.

Putting everything together we have the commend:

python bin/lucy.egg plugins ml_learn -input iris.csv -output dtree_LucyResponse.json -mfile iris_dtree.pkl -classify dtree -clabel class

The structure of the Tree is saved at:

{
    ...
    "payload": {
        ...
        "model_dict": {
            ...
            tree_data: {
                ...
            }
       }
    }
}
Something to consider for a future version: Something. Let's assume that the data has undergone a train-test-split. *Perhaps this will be a feature in the future, but for now it will need to be done manually or done with `Sklearn's train_test_split `_ and then saved into at least 2 CSV's (3 if an accuracy measure is desired). - train.csv - The training data including the class labels - X_test.csv - The input values of the test data. Does not include the class labels. - y_test.csv - If applicable, the class labels for the samples X_test.csv. These must be in the same order as X_test.csv. A model can then be trained on the training data using ``python bin/lucy.egg plugins ml_learn -input ./train.csv -output dtree.json -mfile dtree.pkl -classify dtree -clabel class`` This PKL file can then be used to make predictions on data with the same input. To do this the following can be used: ``python bin/lucy.egg plugins ml_predict -input ./X_test.csv -output ./pred.csv -mfile ./dtree.pkl`` The results from the pred.csv can be compared to the results stored in the y_test.csv

Random Forest

WIP

Regression

Supervised learning methods used to train models based on numerical target values.

Linear Regression

python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json -regression linear -label target

Lasso Regression

python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json -regression lasso -label target

Ridge Regression

python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json -regression Ridge -label target

Support Vector Regression

python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json -regression svr -label target

Decision Tree Regression

python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json -regression dtreereg -label target

Random Forest Regression

python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json -regression rfr -label target

Gaussian Process Regression

python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json -regression gpr -label target

Logistic Regression

should be moved to classification