4.1. Clean¶
The ML Clean Plugin has various options for cleaning a dataset.
python bin/lucy.egg plugins ml_clean -h
Primary Arguments¶
Main specifications for a Clean task.
Argument | Detail |
---|---|
-input | PATH. File path to input file. |
-output | PATH. File path to where the cleaned data will be saved. It is recommend that this file be saved ‘ as a CSV. Since that is what is currently supported. |
-order | COMMA SEPARATED LIST. The order that the tasks will be carried out. If an order is given, only tasks listed in the order will be used when cleaning the data. The default order of tasks is: keep, -drop-column, -bin-column, -filter, -fill-empty, -fill-type, -drop-empty, normalize. |
Choose Columns¶
Define a subset of columns to use.
Argument | Detail |
---|---|
-keep | COMMA SEPARATED LIST. The names of the columns as they appear in the input file that will be kept. The rest will be deleted. This is similar to ‘-drop-column’. It is recommended that either ‘-drop-column’ or ‘-keep’ be used, but never both. |
-drop-column | COMMA SEPARATED LIST. The names of the columns as they appear in the input file that will be deleted. The rest will be kept. This is similar to ‘-keep’. It is recommended that either ‘-drop-column’ or ‘-keep’ be used, but never both. |
Empty Values¶
Methods for dealing with empty values in the dataset.
Argument | Detail |
---|---|
-drop-empty | NONE. Deletes all rows from the dataset that contains empty values |
-fill-empty | COMMA SEPARATED LIST. The names of the columns as they appear in the input file that will have their empty values populated. If no ‘-fill-type’ is specified, the empty values will be populated with their column’s mode. |
-fill-type | INT. ( ‘0’:mean, ‘1’:median, ‘2’:mode, ‘3’:user_specified) Define what will be populated in the empty cells. This must be used with ‘-fill-empty’. |
-fill-value | VALUE. (String, Numerical) A user-defined value that will populate the empty cells. This must be used with ‘-fill-empty’ and ‘-fill-type 3’. |
Bin Column¶
Bins Column Values.
Argument | Detail |
---|---|
-bin-column | STRING. The name of the column, as it appears in the input data, of the Column that will undergo the binning. |
-nbins | INT. The number of bins to split the column into. The default value is 4. |
-bin-values | COMMA SEPARATED LIST. A list of numerical values that defines how the bins will be split. The amount of values provided must be one greater than the number of ‘-nbins’. |
-bin-replace | NONE. The use of this command results in the replacement of the original column that has been binned using ‘-bin-column’. |
Other Preprocessing¶
Other preprocessing options to apply to the entire dataset.
Argument | Detail |
---|---|
-normalize | METHOD. (‘standard’, ‘minmax’) Normalizes the columns of ALL numerical data using the selected scaling method. |
Clean Examples¶
Input¶
A Clean routine begins by choosing the file path of where the input data is to be read from.
python bin/lucy.egg plugins ml_clean -input ./input_data.csv
The dataset that has been selected can now be cleaned by any amount of cleaning methods.
Output¶
Once a dataset has been cleaned into its desired form, it is ready to be saved. This is done by selecting the file path to where th data is to be saved.
python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output
./output_data.csv
Clean is done and the resulting dataset is saved to the output path.
Task Order¶
A user is able to specify the order that the cleaning tasks are performed. If an order is given, only tasks listed in the order will be used when cleaning the data.
Consider the input data:
x | y |
---|---|
-2 | 4 |
-1 | 1 |
0 | |
1 | 1 |
2 | 4 |
python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output
./output_data.csv -drop-column y -drop-empty
Yields the result:
x |
---|
-2 |
-1 |
0 |
1 |
2 |
Because the default order is keep, drop_column, bin_column, filter, fill_empty, fill_type, drop_empty, autoclean, normalize, schema. Consider the tasks drop_column and drop_empty.
This is because the ‘y’ column is dropped before any rows containing empty
values are dropped. Now consider the order is changed using the -order
option:
python bin/lucy.egg plugins ml_clean -input ./input_data.csv -drop-column y
-drop-empty True -order drop_empty, drop_column
The resulting Cleaned data is:
-2 |
---|
-1 |
1 |
2 |
This is because the row containing the empty y value is removed before the entire y column is removed.
Choose Columns¶
There are two ways to select a subset of columns from the original data. Either the desired column(s) can be specified through Keep Columns or the undesired column(s) can be omitted through Drop Columns.
Consider the dataset with the following feature names:
Age | Weight | Gender | Blood_Pressure |
---|---|---|---|
Warning
It is recommend that the column names do not contain spaces. This is especially important when trying to specify column names using the Lucy build through a command prompt.
Keep Columns¶
Keep Columns is the scenario where a subset of columns is to be defined using the names of the columns that are to be kept.
python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output
./output_data.csv -keep Weight,Blood_Pressure
Note
There is no space after the comma scenario.
The resulting dataset will be:
Weight | Blood_Pressure |
---|---|
Drop Columns¶
Drop Columns allows for the reverse action of Keep Columns. Columns that are specified will be removed from the dataset.
python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output
./output_data.csv -drop-column Weight,Blood_Pressure
The resulting dataset will be:
Age | Gender |
---|---|
Empty Values¶
There are some options for populating values in the dataframe that are null/Nan. The values can be chosen to be the mean, median or mode of the column. If desired, a custom value can be used instead.
Consider the following dataset:
x | y |
---|---|
1 | 10 |
1 | 10 |
40 | |
4 |
Drop Empty¶
Drop Empty results in the deletion of all rows that have an empty value.
python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output
./output_data.csv -drop-empty
The resulting dataset is
x | y |
---|---|
1 | 10 |
1 | 10 |
Fill Empty¶
Fill Empty is a way to specify which column(s) will have there empty values populated. The mode is used by default.
python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output
./output_data.csv -fill-empty x
The resulting dataset is
x | y |
---|---|
1 | 10 |
1 | 10 |
1 | 40 |
4 |
Fill Type¶
Empty values can be populated using a few options:
- 0: Mean
- 1: Median
- 2: Mode
- 3: User Specified
Note
If fill type 3 is chosen, the value has to specified with a fill value. See Fill Value for an example.
python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output
./output_data.csv -fill-empty y -fill-type 0
The resulting dataset is
x | y |
---|---|
1 | 10 |
1 | 10 |
40 | |
4 | 20 |
Note
When the column contains categorical data, if a fill type of ‘0’ or ‘1’ is used, the value will fill with the mode.
Fill Value¶
All empty values can be populated using a user-defined value. In order to do this, fill empty must be used correctly and a fill type of ‘3’ must be selected.
python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output
./output_data.csv -fill-empty y -fill-type 3 -fill-value 30
The resulting dataset is
x | y |
---|---|
1 | 10 |
1 | 10 |
40 | |
4 | 30 |
Fill Multiple Columns¶
The fill type that is specified will be applied to all columns specified in fill empty. Currently, there is not an option to directly handle different fill types for separate columns.
python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output
./output_data.csv -fill-empty x,y -fill-type 3 -fill-value 30
The resulting data set has populated 30 for each empty value in all the columns specified.
x | y |
---|---|
1 | 10 |
1 | 10 |
30 | 40 |
4 | 30 |
Bin Column¶
WIP
Other Preprocessing¶
Other data preprocessing options that were not mentioned.
Normalize¶
The general idea for normalizers is convert a list of values to be unitless. The options are:
- standard
- minmax
python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output
./output_data.csv -normalize standard