# 4.3. Learn¶

The ML Learn Plugin allows for a Machine Learning Model to be trained.

`python bin/lucy.egg plugins ml_learn -h`

## Primary Arguments¶

Main specifications for a Learn task.

Argument | Detail |
---|---|

-input | PATH. File path to the input data. |

-output | PATH. File path to where the output summary file will be saved. It is
recommend that this ‘file be saved as a JSON. |

-mfile | PATH. File path to where the resulting Pickled model fill will be
saved. This is a serialized pickle (.pkl) file. |

-clabel | STR. Used with ‘-classify’ and ‘-regression’. The name of the
target/response/output column. |

## Preprocessing¶

Manipulation of the data before using it.

Argument | Detail |
---|---|

-autoclean | NONE. The use of this command cleans the data based on a pre-defined
set of rules. |

-independents | COMMA SEPARATED LIST. A subset of column names that will be used for
Learning. The subset will automatically include a ‘-clabel’ if specified. |

-normalize | METHOD. (‘standard’, ‘minmax’) Normalizes the columns of ALL
numerical data using the selected scaling method. |

-degree | INT. (1 - 5) Generates new input data to be polynomial combinations
of the original input dataset. This transformation can be used when the
input data is numerical. |

-cv | METHOD. (‘yes’, ‘no’) Turn on Cross-validation. Currently, option
‘yes’ enables K-fold cross-validation. |

-cval | INT. Used with ‘-cv’ when the K-fold Cross-Validation has been
selected. The value given is the ‘K’ number of folds. |

## Cluster¶

Unsupervised learning methods used to train models that put observations into similar categories.

Argument | Detail |
---|---|

-cluster | ALGO. (‘kmeans’, ‘meanshift’) Unsupervised Learning algorithms that
groups similar data points. |

-nc | INT. Used with ‘-cluster kmeans’. The number of clusters that the
observations will be put into based on found similarity. The default
value is 3. |

-bandwidth | FLOAT. Used with ‘-cluster meanshift’. Reflects the distribution of
the observations. The bandwidth selected must be done while being aware
of range(s) of the dataset. A dataset that has been normalized using a
minmax scalar will have numerical values between 0 and 1. This means that
the bandwidth should also be between 0 and 1. |

## Classify¶

Supervised learning methods used to train models based on categorical target values.

Argument | Detail |
---|---|

-classify | ALGO. (‘dtree’, ‘rfc’, ‘gnb’, ‘logistic’) Learns a classification
model using the algorithm given. |

-cval | FLOAT. (0 - 100) Used with ‘-classify dtree’, ‘-classify rfc’, or
‘-classify gnb’. Percentage of the data to be used for cross-validation.
The default value is 1. |

-criterion | CRITERION. (‘gini’, ‘entropy’) Used with ‘-classify dtree’ and
‘-classify rfc’. The method to measure the quality of the split. The
default is ‘gini’. |

-nestimators | INT. Used with ‘-classify rfc’. The number of trees in the forest. The
default value is 10. |

## Regression¶

Supervised learning methods used to train models based on numerical target values.

Argument | Detail |
---|---|

-regression | ALGO. (‘linear’, ‘lasso’, ‘ridge’, ‘svr’, ‘rfr’, ‘dtreereg’, ‘gpr’)
Learns a regression model using the algorithm given. |

-grid-response | NONE. Used with ‘-regression’. Works for input feature space sizes of
1 or 2. With the response, this results in a list of points that are 2D
or 3D. Input prediction points are arranged based on the range(s) of the
input values. The idea is that ‘-grid-response’ generates points that can
be used to construct the regression line or a surface. Currently provides
10^p points, where ‘p’ is the number of input parameters. |

-optimize | METHOD. (‘min’, ‘max’) Uses the model to determine the inputs
required to obtain a prediction value that meets the optimize method
selected. Used with ‘-regression linear/lasso/ridge’. |

## Learn Examples¶

### Input¶

A Learn routine begins by choosing the file path of where the input data is to be read from.

`python bin/lucy.egg plugins ml_learn -input ./input_data.csv`

### Output¶

A Learn routine ends where the output data is to be saved. This file is in a json format and is packed with information. That is useful to the user and will contain the information necessary to reconstruct the Learned model without the requirement of python.

```
python bin/lucy.egg plugins ml_learn -input ./input_data.csv -output
./out.json
```

### Preprocessing¶

#### Dimension Reduction¶

Dimensionality Reduction is the process transforming data to a lower dimensional space. Ideally, this lower dimensional space will retain many of the same properties as the original data set.

**Feature Importance**

Lucy uses two metrics for measuring Feature Importance given only a Target column. Currently, these metrics can only be obtained when running a Lucy Learn routine.

- Correlation - The correlation of each numeric input variable and a numeric Target
- Gini Importance - The importance measure of each input variable for a given target.
To obtain these, the option

`-feature-importance`

is used with no argument required. These values are populate when specifying a`-clabel`

and are saved in the Lucy-Response JSON when specify a path for the Lucy-Response using`-output`

.Consider the iris dataset and the following command:

`python bin/lucy.egg plugins ml_learn -input iris.csv -output iris_output.json -clabel class -classify dtree -feature-importance`

The Lucy-Response is populated wth the following scores:

{ ... "payload": { ... "feature_importance": { "gini_importance": { "sepal_width": 0.006145984326963746, "petal_width": 0.4852350158451978, "sepal_length": 0.0059189255008685404, "petal_length": 0.50270007432697 } } } }

**Dimensional Reduction Suggestions**

-dim-reduct

**Principal Component Analysis**

PCA is a linear dimensionality reduction technique that uses Singular Value Decomposition to project the original data to a lower dimensional space. Lucy keeps the principal components that explain 99% of the variance of the original data. These components are then used to transform the original data to a space of a lower dimension. The resulting columns do not share any direct relation to the features of the original data.

`python bin/lucy.egg plugins ml_learn -input ./input.csv -output ./out.json -regression linear -label target -pca`

The resulting PCA dataset is saved under ‘pca_dataset’ in the output file. This can be used in place of the original input. This PCA dataset would need to be saved as a csv and a new instance of Learn would need to be initiated where the input data is the pca_dataset along with the copied over target column.

Note

Currently, these dimensional reduction insight options can only be used when running a successful Learn process. The dimension reduction results are not used. They are options for the user to consider.

#### Independents¶

-independents

#### Normalize¶

-normalize

#### Autoclean¶

Autoclean is a way to clean a dataset before before performing the Learn task. Details of the exact procedures performed by Autoclean can be found at Clean’s Autoclean

The argument `-autoclean`

is used in order to clean the dataset before the
Learn task is applied.

```
python bin/lucy.egg plugins ml_learn -input ./input_data.csv -output
./out.json -regression linear -clabel target -autoclean
```

### Clustering¶

Clustering is a unsupervised learning technique that discovers natural grouping in a dataset.

#### K-Means¶

The K-Means clustering algorithm groups samples into a user-defined *p* number
of groups. The group a sample belongs to is based on what centroid it is closest
to. There are *p* centroids. The location of these centroids are found through
interation from the K-Means algorithm. The algorithm scales well with a large
number of samples.

Consider the Iris Dataset containing 3 different classes:

sepal_length | sepal_width | petal_length | petal_width | class |
---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | 0 |

7 | 3.2 | 4.27 | 1.4 | 1 |

6.3 | 3.3 | 6 | 2.5 | 2 |

… | … | … | … | … |

Plotting the petal length vs. the petal width shows a fairly clear separation in the three class.

Now let’s use K-Means clustering to visually compare these results. It is
recommended number of cluster be fit to the problem in mind. This can be
accomplished using `-nc`

.

```
python bin/lucy.egg plugins ml_learn -input ./input_data.csv -output
./out.json -independents petal_length,petal_width -cluster kmeans -nc 3
```

The resulting group assignments are saved in the Lucy-Response JSON file saved
at the `-output`

path. The found labels are saved as a dictionary where the
key is the id (or the order) in which the the input points were provided to the
LEARN routine. The number of unique values these points are assigned to is equal
to the `-nc`

provided. They location where these can be found in the
Lucy-Response is shown below:

{ ... "payload": { ... "model_dict": { "labels_": { "139": 0, "138": 2, "24": 1, ... } } } }

‘payload’ -> ‘model_dict’ -> ‘labels_’.

To visualize the results, the samples can be assigned the groups they have been given from the K-Means clustering in the same order the input data was given. The results can be seen below.

Additionally, the location of the centroids can be found in the output JSON at ‘payload’ -> ‘model_dict’ -> ‘cluster_centers_’. For this example the centroids were found to be

```
cluster_centers_ = [
[1.4639999999999995,
0.24400000000000022],
[4.269230769230769,
1.3423076923076924],
[5.595833333333332,
2.0374999999999996]
]
```

Note

There are no labels associated with the result. This is an unsupervised learning algorithm. The samples are grouped but the groups are not assigned any labels since there are no labels to learn from.

#### Mean Shift¶

MeanShift clustering aims to discover blobs in a smooth density of samples. It is a centroid based algorithm that works by updating candidates for centroids to be the mean of the points within a given region called the bandwidth. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.

Consider the Iris Dataset containing 3 different classes:

sepal_length | sepal_width | petal_length | petal_width | class |
---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | 0 |

7 | 3.2 | 4.27 | 1.4 | 1 |

6.3 | 3.3 | 6 | 2.5 | 2 |

… |

Plotting the petal length vs. the petal width shows a fairly clear separation in the three class.

Now let’s use Mean Shift clustering to visually compare its results with the true category separation. A bandwidth should be specified. It is recommended that the data be normalized before performing the algorithm. This will narrow the required bandwidth to be between 0 and 1.

```
python bin/lucy.egg plugins ml_learn -input ./input_data.csv -output
./out.json -independents petal_length,petal_width -cluster meansshift
-bandwidth 0.5 -normalize standard
```

The resulting group assignments can be found in the ouput JSON at the location ‘payload’ -> ‘model_dict’ -> ‘labels_’. To visualize the results, the samples can be assigned the groups they have been given from the Mean Shift clustering in the same order the input data was given. The results can be seen below.

Additionally, the location of the centroids can be found in the output JSON at ‘payload’ -> ‘model_dict’ -> ‘cluster_centers_’. For this example the centroids are reflective of the normalized data and are found to be

```
cluster_centers_ = [
[-1.3005214861029293,
-1.2509378621062448],
[0.37427641672771905,
0.24821582279821572],
[0.8873031279121748,
1.0261947088710563]
]
```

Note

There ended up being 3 total clusters but this won’t always be the case. The number of clusters is based on the bandwidth. As bandwidth increases the number of clusters will decrease.

### Classification¶

Supervised learning methods used to train models based on categorical target values.

**WIP**

#### Decision Tree Classifier¶

The trained model is a single Decision Tree whose target column is categorical.

Consider the Iris Dataset containing 3 different classes:

sepal_length | sepal_width | petal_length | petal_width | class |
---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | 0 |

7 | 3.2 | 4.27 | 1.4 | 1 |

6.3 | 3.3 | 6 | 2.5 | 2 |

… |

There are a few requirements when performing classification using a Decision
Tree. We need `-input`

data, to specify a `-clabel`

, and use ```
-classify
dtree
```

. Additionally, we will need to specify a `mfile`

if we want to save
the resulting model to use later in PREDICT. If we want some readable
information about the model, we can specify `-output`

to save an informative
Lucy-Response in a JSON format.

Putting everything together we have the commend:

```
python bin/lucy.egg plugins ml_learn -input iris.csv -output
dtree_LucyResponse.json -mfile iris_dtree.pkl -classify dtree -clabel class
```

The structure of the Tree is saved at:

```
{
...
"payload": {
...
"model_dict": {
...
tree_data: {
...
}
}
}
}
```

## Something to consider for a future version:

Something. Let's assume that the data has undergone a train-test-split. *Perhaps this will be a feature in the future, but for now it will need to be done manually or done with `Sklearn's train_test_split#### Random Forest¶

**WIP**

### Regression¶

Supervised learning methods used to train models based on numerical target values.

#### Linear Regression¶

```
python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json
-regression linear -label target
```

#### Lasso Regression¶

```
python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json
-regression lasso -label target
```

#### Ridge Regression¶

```
python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json
-regression Ridge -label target
```

#### Support Vector Regression¶

```
python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json
-regression svr -label target
```

#### Decision Tree Regression¶

```
python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json
-regression dtreereg -label target
```

#### Random Forest Regression¶

```
python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json
-regression rfr -label target
```

#### Gaussian Process Regression¶

```
python bin/lucy.egg plugins ml_learn -input ./input.csv -output./out.json
-regression gpr -label target
```

#### Logistic Regression¶

**should be moved to classification**