4.1. Clean

The ML Clean Plugin has various options for cleaning a dataset.

python bin/lucy.egg plugins ml_clean -h

Primary Arguments

Main specifications for a Clean task.

Argument Detail
-input PATH. File path to input file.
-output PATH. File path to where the cleaned data will be saved. It is recommend that this file be saved ‘ as a CSV. Since that is what is currently supported.
-order COMMA SEPARATED LIST. The order that the tasks will be carried out. If an order is given, only tasks listed in the order will be used when cleaning the data. The default order of tasks is: keep, -drop-column, -bin-column, -filter, -fill-empty, -fill-type, -drop-empty, normalize.

Choose Columns

Define a subset of columns to use.

Argument Detail
-keep COMMA SEPARATED LIST. The names of the columns as they appear in the input file that will be kept. The rest will be deleted. This is similar to ‘-drop-column’. It is recommended that either ‘-drop-column’ or ‘-keep’ be used, but never both.
-drop-column COMMA SEPARATED LIST. The names of the columns as they appear in the input file that will be deleted. The rest will be kept. This is similar to ‘-keep’. It is recommended that either ‘-drop-column’ or ‘-keep’ be used, but never both.

Empty Values

Methods for dealing with empty values in the dataset.

Argument Detail
-drop-empty NONE. Deletes all rows from the dataset that contains empty values
-fill-empty COMMA SEPARATED LIST. The names of the columns as they appear in the input file that will have their empty values populated. If no ‘-fill-type’ is specified, the empty values will be populated with their column’s mode.
-fill-type INT. ( ‘0’:mean, ‘1’:median, ‘2’:mode, ‘3’:user_specified) Define what will be populated in the empty cells. This must be used with ‘-fill-empty’.
-fill-value VALUE. (String, Numerical) A user-defined value that will populate the empty cells. This must be used with ‘-fill-empty’ and ‘-fill-type 3’.

Bin Column

Bins Column Values.

Argument Detail
-bin-column STRING. The name of the column, as it appears in the input data, of the Column that will undergo the binning.
-nbins INT. The number of bins to split the column into. The default value is 4.
-bin-values COMMA SEPARATED LIST. A list of numerical values that defines how the bins will be split. The amount of values provided must be one greater than the number of ‘-nbins’.
-bin-replace NONE. The use of this command results in the replacement of the original column that has been binned using ‘-bin-column’.

Other Preprocessing

Other preprocessing options to apply to the entire dataset.

Argument Detail
-normalize METHOD. (‘standard’, ‘minmax’) Normalizes the columns of ALL numerical data using the selected scaling method.

Clean Examples

Input

A Clean routine begins by choosing the file path of where the input data is to be read from.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv

The dataset that has been selected can now be cleaned by any amount of cleaning methods.

Output

Once a dataset has been cleaned into its desired form, it is ready to be saved. This is done by selecting the file path to where th data is to be saved.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv

Clean is done and the resulting dataset is saved to the output path.

Task Order

A user is able to specify the order that the cleaning tasks are performed. If an order is given, only tasks listed in the order will be used when cleaning the data.

Consider the input data:

x y
-2 4
-1 1
0  
1 1
2 4

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -drop-column y -drop-empty

Yields the result:

x
-2
-1
0
1
2

Because the default order is keep, drop_column, bin_column, filter, fill_empty, fill_type, drop_empty, autoclean, normalize, schema. Consider the tasks drop_column and drop_empty.

This is because the ‘y’ column is dropped before any rows containing empty values are dropped. Now consider the order is changed using the -order option:

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -drop-column y -drop-empty True -order drop_empty, drop_column

The resulting Cleaned data is:

-2
-1
1
2

This is because the row containing the empty y value is removed before the entire y column is removed.

Choose Columns

There are two ways to select a subset of columns from the original data. Either the desired column(s) can be specified through Keep Columns or the undesired column(s) can be omitted through Drop Columns.

Consider the dataset with the following feature names:

Age Weight Gender Blood_Pressure
       

Warning

It is recommend that the column names do not contain spaces. This is especially important when trying to specify column names using the Lucy build through a command prompt.

Keep Columns

Keep Columns is the scenario where a subset of columns is to be defined using the names of the columns that are to be kept.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -keep Weight,Blood_Pressure

Note

There is no space after the comma scenario.

The resulting dataset will be:

Weight Blood_Pressure
   

Drop Columns

Drop Columns allows for the reverse action of Keep Columns. Columns that are specified will be removed from the dataset.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -drop-column Weight,Blood_Pressure

The resulting dataset will be:

Age Gender
   

Empty Values

There are some options for populating values in the dataframe that are null/Nan. The values can be chosen to be the mean, median or mode of the column. If desired, a custom value can be used instead.

Consider the following dataset:

x y
1 10
1 10
  40
4  

Drop Empty

Drop Empty results in the deletion of all rows that have an empty value.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -drop-empty

The resulting dataset is

x y
1 10
1 10

Fill Empty

Fill Empty is a way to specify which column(s) will have there empty values populated. The mode is used by default.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -fill-empty x

The resulting dataset is

x y
1 10
1 10
1 40
4  

Fill Type

Empty values can be populated using a few options:

  • 0: Mean
  • 1: Median
  • 2: Mode
  • 3: User Specified

Note

If fill type 3 is chosen, the value has to specified with a fill value. See Fill Value for an example.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -fill-empty y -fill-type 0

The resulting dataset is

x y
1 10
1 10
  40
4 20

Note

When the column contains categorical data, if a fill type of ‘0’ or ‘1’ is used, the value will fill with the mode.

Fill Value

All empty values can be populated using a user-defined value. In order to do this, fill empty must be used correctly and a fill type of ‘3’ must be selected.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -fill-empty y -fill-type 3 -fill-value 30

The resulting dataset is

x y
1 10
1 10
  40
4 30

Fill Multiple Columns

The fill type that is specified will be applied to all columns specified in fill empty. Currently, there is not an option to directly handle different fill types for separate columns.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -fill-empty x,y -fill-type 3 -fill-value 30

The resulting data set has populated 30 for each empty value in all the columns specified.

x y
1 10
1 10
30 40
4 30

Bin Column

WIP

Other Preprocessing

Other data preprocessing options that were not mentioned.

Normalize

The general idea for normalizers is convert a list of values to be unitless. The options are:

  • standard
  • minmax

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -normalize standard

Autoclean