Clean ===== The ML Clean Plugin has various options for cleaning a dataset. ``python bin/lucy.egg plugins ml_clean -h`` Primary Arguments ^^^^^^^^^^^^^^^^^ Main specifications for a Clean task. .. list-table:: :widths: 10 40 :header-rows: 1 * - Argument - Detail * - -input - **PATH**. File path to input file. * - -output - **PATH**. File path to where the cleaned data will be saved. It is recommend that this file be saved ' as a CSV. Since that is what is currently supported. * - -order - **COMMA SEPARATED LIST**. The order that the tasks will be carried out. If an order is given, only tasks listed in the order will be used when cleaning the data. The default order of tasks is: keep, -drop-column, -bin-column, -filter, -fill-empty, -fill-type, -drop-empty, normalize. Choose Columns ^^^^^^^^^^^^^^ Define a subset of columns to use. .. list-table:: :widths: 10 40 :header-rows: 1 * - Argument - Detail * - -keep - **COMMA SEPARATED LIST**. The names of the columns as they appear in the input file that will be kept. The rest will be deleted. This is similar to '-drop-column'. It is recommended that either '-drop-column' or '-keep' be used, but never both. * - -drop-column - **COMMA SEPARATED LIST**. The names of the columns as they appear in the input file that will be deleted. The rest will be kept. This is similar to '-keep'. It is recommended that either '-drop-column' or '-keep' be used, but never both. Empty Values ^^^^^^^^^^^^ Methods for dealing with empty values in the dataset. .. list-table:: :widths: 10 40 :header-rows: 1 * - Argument - Detail * - -drop-empty - **NONE**. Deletes all rows from the dataset that contains empty values * - -fill-empty - **COMMA SEPARATED LIST**. The names of the columns as they appear in the input file that will have their empty values populated. If no '-fill-type' is specified, the empty values will be populated with their column's mode. * - -fill-type - **INT**. ( '0':mean, '1':median, '2':mode, '3':user_specified) Define what will be populated in the empty cells. This must be used with '-fill-empty'. * - -fill-value - **VALUE**. (String, Numerical) A user-defined value that will populate the empty cells. This must be used with '-fill-empty' and '-fill-type 3'. Bin Column ^^^^^^^^^^ Bins Column Values. .. list-table:: :widths: 10 40 :header-rows: 1 * - Argument - Detail * - -bin-column - **STRING**. The name of the column, as it appears in the input data, of the Column that will undergo the binning. * - -nbins - **INT**. The number of bins to split the column into. The default value is 4. * - -bin-values - **COMMA SEPARATED LIST**. A list of numerical values that defines how the bins will be split. The amount of values provided must be one greater than the number of '-nbins'. * - -bin-replace - **NONE**. The use of this command results in the replacement of the original column that has been binned using '-bin-column'. Other Preprocessing ^^^^^^^^^^^^^^^^^^^ Other preprocessing options to apply to the entire dataset. .. list-table:: :widths: 10 40 :header-rows: 1 * - Argument - Detail * - -normalize - **METHOD**. ('standard', 'minmax') Normalizes the columns of ALL numerical data using the selected scaling method. .. include:: clean_examples.rst