Learn ===== The ML Learn Plugin allows for a Machine Learning Model to be trained. ``python bin/lucy.egg plugins ml_learn -h`` Primary Arguments ^^^^^^^^^^^^^^^^^ Main specifications for a Learn task. .. list-table:: :widths: 10 40 :header-rows: 1 * - Argument - Detail * - -input - **PATH**. File path to the input data. * - -output - **PATH**. File path to where the output summary file will be saved. It is recommend that this 'file be saved as a JSON. * - -mfile - **PATH**. File path to where the resulting Pickled model fill will be saved. This is a serialized pickle (.pkl) file. * - -clabel - **STR**. Used with '-classify' and '-regression'. The name of the target/response/output column. Preprocessing ^^^^^^^^^^^^^ Manipulation of the data before using it. .. list-table:: :widths: 10 40 :header-rows: 1 * - Argument - Detail * - -autoclean - **NONE**. The use of this command cleans the data based on a pre-defined set of rules. * - -independents - **COMMA SEPARATED LIST**. A subset of column names that will be used for Learning. The subset will automatically include a '-clabel' if specified. * - -normalize - **METHOD**. ('standard', 'minmax') Normalizes the columns of ALL numerical data using the selected scaling method. * - -degree - **INT**. (1 - 5) Generates new input data to be polynomial combinations of the original input dataset. This transformation can be used when the input data is numerical. * - -cv - **METHOD**. ('yes', 'no') Turn on Cross-validation. Currently, option 'yes' enables K-fold cross-validation. * - -cval - **INT**. Used with '-cv' when the K-fold Cross-Validation has been selected. The value given is the 'K' number of folds. Cluster ^^^^^^^ Unsupervised learning methods used to train models that put observations into similar categories. .. list-table:: :widths: 10 40 :header-rows: 1 * - Argument - Detail * - -cluster - **ALGO**. ('kmeans', 'meanshift') Unsupervised Learning algorithms that groups similar data points. * - -nc - **INT**. Used with '-cluster kmeans'. The number of clusters that the observations will be put into based on found similarity. The default value is 3. * - -bandwidth - **FLOAT**. Used with '-cluster meanshift'. Reflects the distribution of the observations. The bandwidth selected must be done while being aware of range(s) of the dataset. A dataset that has been normalized using a minmax scalar will have numerical values between 0 and 1. This means that the bandwidth should also be between 0 and 1. Classify ^^^^^^^^ Supervised learning methods used to train models based on categorical target values. .. list-table:: :widths: 10 40 :header-rows: 1 * - Argument - Detail * - -classify - **ALGO**. ('dtree', 'rfc', 'gnb', 'logistic') Learns a classification model using the algorithm given. * - -cval - **FLOAT**. (0 - 100) Used with '-classify dtree', '-classify rfc', or '-classify gnb'. Percentage of the data to be used for cross-validation. The default value is 1. * - -criterion - **CRITERION**. ('gini', 'entropy') Used with '-classify dtree' and '-classify rfc'. The method to measure the quality of the split. The default is 'gini'. * - -nestimators - **INT**. Used with '-classify rfc'. The number of trees in the forest. The default value is 10. Regression ^^^^^^^^^^ Supervised learning methods used to train models based on numerical target values. .. list-table:: :widths: 10 40 :header-rows: 1 * - Argument - Detail * - -regression - **ALGO**. ('linear', 'lasso', 'ridge', 'svr', 'rfr', 'dtreereg', 'gpr') Learns a regression model using the algorithm given. * - -grid-response - **NONE**. Used with '-regression'. Works for input feature space sizes of 1 or 2. With the response, this results in a list of points that are 2D or 3D. Input prediction points are arranged based on the range(s) of the input values. The idea is that '-grid-response' generates points that can be used to construct the regression line or a surface. Currently provides 10^p points, where 'p' is the number of input parameters. * - -optimize - **METHOD**. ('min', 'max') Uses the model to determine the inputs required to obtain a prediction value that meets the optimize method selected. Used with '-regression linear/lasso/ridge'. .. include:: learn_examples.rst