Learn
=====

The ML Learn Plugin allows for a Machine Learning Model to be trained.

``python bin/lucy.egg plugins ml_learn -h``

Primary Arguments
^^^^^^^^^^^^^^^^^
Main specifications for a Learn task.

.. list-table::
   :widths: 10 40
   :header-rows: 1

   * - Argument
     - Detail

   * - -input
     - **PATH**. File path to the input data.

   * - -output
     - **PATH**. File path to where the output summary file will be saved. It is
       recommend that this 'file be saved as a JSON.

   * - -mfile
     - **PATH**. File path to where the resulting Pickled model fill will be
       saved. This is a serialized pickle (.pkl) file.

   * - -clabel
     - **STR**. Used with '-classify' and '-regression'. The name of the
       target/response/output column.

Preprocessing
^^^^^^^^^^^^^
Manipulation of the data before using it.

.. list-table::
   :widths: 10 40
   :header-rows: 1

   * - Argument
     - Detail

   * - -autoclean
     - **NONE**. The use of this command cleans the data based on a pre-defined
       set of rules.

   * - -independents
     - **COMMA SEPARATED LIST**. A subset of column names that will be used for
       Learning. The subset will automatically include a '-clabel' if specified.

   * - -normalize
     - **METHOD**. ('standard', 'minmax') Normalizes the columns of ALL
       numerical data using the selected scaling method.

   * - -degree
     - **INT**. (1 - 5) Generates new input data to be polynomial combinations
       of the original input dataset. This transformation can be used when the
       input data is numerical.

   * - -cv
     - **METHOD**. ('yes', 'no') Turn on Cross-validation. Currently, option
       'yes' enables K-fold cross-validation.

   * - -cval
     - **INT**. Used with '-cv' when the K-fold Cross-Validation has been
       selected. The value given is the 'K' number of folds.

Cluster
^^^^^^^
Unsupervised learning methods used to train models that put observations into
similar categories.

.. list-table::
   :widths: 10 40
   :header-rows: 1

   * - Argument
     - Detail

   * - -cluster
     - **ALGO**. ('kmeans', 'meanshift') Unsupervised Learning algorithms that
       groups similar data points.

   * - -nc
     - **INT**. Used with '-cluster kmeans'. The number of clusters that the
       observations will be put into based on found similarity. The default
       value is 3.

   * - -bandwidth
     - **FLOAT**. Used with '-cluster meanshift'. Reflects the distribution of
       the observations. The bandwidth selected must be done while being aware
       of range(s) of the dataset. A dataset that has been normalized using a
       minmax scalar will have numerical values between 0 and 1. This means that
       the bandwidth should also be between 0 and 1.

Classify
^^^^^^^^
Supervised learning methods used to train models based on categorical target
values.

.. list-table::
   :widths: 10 40
   :header-rows: 1

   * - Argument
     - Detail

   * - -classify
     - **ALGO**. ('dtree', 'rfc', 'gnb', 'logistic') Learns a classification
       model using the algorithm given.

   * - -cval
     - **FLOAT**. (0 - 100) Used with '-classify dtree', '-classify rfc', or
       '-classify gnb'. Percentage of the data to be used for cross-validation.
       The default value is 1.

   * - -criterion
     - **CRITERION**. ('gini', 'entropy') Used with '-classify dtree' and
       '-classify rfc'. The method to measure the quality of the split. The
       default is 'gini'.

   * - -nestimators
     - **INT**. Used with '-classify rfc'. The number of trees in the forest. The
       default value is 10.

Regression
^^^^^^^^^^
Supervised learning methods used to train models based on numerical target
values.

.. list-table::
   :widths: 10 40
   :header-rows: 1

   * - Argument
     - Detail

   * - -regression
     - **ALGO**. ('linear', 'lasso', 'ridge', 'svr', 'rfr', 'dtreereg', 'gpr')
       Learns a regression model using the algorithm given.

   * - -grid-response
     - **NONE**. Used with '-regression'. Works for input feature space sizes of
       1 or 2. With the response, this results in a list of points that are 2D
       or 3D. Input prediction points are arranged based on the range(s) of the
       input values. The idea is that '-grid-response' generates points that can
       be used to construct the regression line or a surface. Currently provides
       10^p points, where 'p' is the number of input parameters.

   * - -optimize
     - **METHOD**. ('min', 'max') Uses the model to determine the inputs
       required to obtain a prediction value that meets the optimize method
       selected. Used with '-regression linear/lasso/ridge'.

.. include:: learn_examples.rst