4.1. Clean¶

The ML Clean Plugin has various options for cleaning a dataset.

python bin/lucy.egg plugins ml_clean -h

Primary Arguments¶

Main specifications for a Clean task.

Argument	Detail
-input	PATH. File path to input file.
-output	PATH. File path to where the cleaned data will be saved. It is recommend that this file be saved ‘ as a CSV. Since that is what is currently supported.
-order	COMMA SEPARATED LIST. The order that the tasks will be carried out. If an order is given, only tasks listed in the order will be used when cleaning the data. The default order of tasks is: keep, -drop-column, -bin-column, -filter, -fill-empty, -fill-type, -drop-empty, normalize.

Choose Columns¶

Define a subset of columns to use.

Argument	Detail
-keep	COMMA SEPARATED LIST. The names of the columns as they appear in the input file that will be kept. The rest will be deleted. This is similar to ‘-drop-column’. It is recommended that either ‘-drop-column’ or ‘-keep’ be used, but never both.
-drop-column	COMMA SEPARATED LIST. The names of the columns as they appear in the input file that will be deleted. The rest will be kept. This is similar to ‘-keep’. It is recommended that either ‘-drop-column’ or ‘-keep’ be used, but never both.

Empty Values¶

Methods for dealing with empty values in the dataset.

Argument	Detail
-drop-empty	NONE. Deletes all rows from the dataset that contains empty values
-fill-empty	COMMA SEPARATED LIST. The names of the columns as they appear in the input file that will have their empty values populated. If no ‘-fill-type’ is specified, the empty values will be populated with their column’s mode.
-fill-type	INT. ( ‘0’:mean, ‘1’:median, ‘2’:mode, ‘3’:user_specified) Define what will be populated in the empty cells. This must be used with ‘-fill-empty’.
-fill-value	VALUE. (String, Numerical) A user-defined value that will populate the empty cells. This must be used with ‘-fill-empty’ and ‘-fill-type 3’.

Bin Column¶

Bins Column Values.

Argument	Detail
-bin-column	STRING. The name of the column, as it appears in the input data, of the Column that will undergo the binning.
-nbins	INT. The number of bins to split the column into. The default value is 4.
-bin-values	COMMA SEPARATED LIST. A list of numerical values that defines how the bins will be split. The amount of values provided must be one greater than the number of ‘-nbins’.
-bin-replace	NONE. The use of this command results in the replacement of the original column that has been binned using ‘-bin-column’.

Other Preprocessing¶

Other preprocessing options to apply to the entire dataset.

Argument	Detail
-normalize	METHOD. (‘standard’, ‘minmax’) Normalizes the columns of ALL numerical data using the selected scaling method.

Clean Examples¶

Input¶

A Clean routine begins by choosing the file path of where the input data is to be read from.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv

The dataset that has been selected can now be cleaned by any amount of cleaning methods.

Output¶

Once a dataset has been cleaned into its desired form, it is ready to be saved. This is done by selecting the file path to where th data is to be saved.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv

Clean is done and the resulting dataset is saved to the output path.

Task Order¶

A user is able to specify the order that the cleaning tasks are performed. If an order is given, only tasks listed in the order will be used when cleaning the data.

Consider the input data:

x	y
-2	4
-1	1
0
1	1
2	4

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -drop-column y -drop-empty

Yields the result:

x
-2
-1
0
1
2

Because the default order is keep, drop_column, bin_column, filter, fill_empty, fill_type, drop_empty, autoclean, normalize, schema. Consider the tasks drop_column and drop_empty.

This is because the ‘y’ column is dropped before any rows containing empty values are dropped. Now consider the order is changed using the -order option:

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -drop-column y -drop-empty True -order drop_empty, drop_column

The resulting Cleaned data is:

-2
-1
1
2

This is because the row containing the empty y value is removed before the entire y column is removed.

Choose Columns¶

There are two ways to select a subset of columns from the original data. Either the desired column(s) can be specified through Keep Columns or the undesired column(s) can be omitted through Drop Columns.

Consider the dataset with the following feature names:

Age	Weight	Gender	Blood_Pressure

Warning

It is recommend that the column names do not contain spaces. This is especially important when trying to specify column names using the Lucy build through a command prompt.

Keep Columns¶

Keep Columns is the scenario where a subset of columns is to be defined using the names of the columns that are to be kept.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -keep Weight,Blood_Pressure

Note

There is no space after the comma scenario.

The resulting dataset will be:

Weight	Blood_Pressure

Drop Columns¶

Drop Columns allows for the reverse action of Keep Columns. Columns that are specified will be removed from the dataset.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -drop-column Weight,Blood_Pressure

The resulting dataset will be:

Age	Gender

Empty Values¶

There are some options for populating values in the dataframe that are null/Nan. The values can be chosen to be the mean, median or mode of the column. If desired, a custom value can be used instead.

Consider the following dataset:

x	y
1	10
1	10
	40
4

Drop Empty¶

Drop Empty results in the deletion of all rows that have an empty value.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -drop-empty

The resulting dataset is

x	y
1	10
1	10

Fill Empty¶

Fill Empty is a way to specify which column(s) will have there empty values populated. The mode is used by default.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -fill-empty x

The resulting dataset is

x	y
1	10
1	10
1	40
4

Fill Type¶

Empty values can be populated using a few options:

0: Mean
1: Median
2: Mode
3: User Specified

Note

If fill type 3 is chosen, the value has to specified with a fill value. See Fill Value for an example.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -fill-empty y -fill-type 0

The resulting dataset is

x	y
1	10
1	10
	40
4	20

Note

When the column contains categorical data, if a fill type of ‘0’ or ‘1’ is used, the value will fill with the mode.

Fill Value¶

All empty values can be populated using a user-defined value. In order to do this, fill empty must be used correctly and a fill type of ‘3’ must be selected.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -fill-empty y -fill-type 3 -fill-value 30

The resulting dataset is

x	y
1	10
1	10
	40
4	30

Fill Multiple Columns¶

The fill type that is specified will be applied to all columns specified in fill empty. Currently, there is not an option to directly handle different fill types for separate columns.

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -fill-empty x,y -fill-type 3 -fill-value 30

The resulting data set has populated 30 for each empty value in all the columns specified.

x	y
1	10
1	10
30	40
4	30

Bin Column¶

WIP

Other Preprocessing¶

Other data preprocessing options that were not mentioned.

Normalize¶

The general idea for normalizers is convert a list of values to be unitless. The options are:

standard

minmax

python bin/lucy.egg plugins ml_clean -input ./input_data.csv -output ./output_data.csv -normalize standard

4.1. Clean¶

Primary Arguments¶

Choose Columns¶

Empty Values¶

Bin Column¶

Other Preprocessing¶

Clean Examples¶

Input¶

Output¶

Task Order¶

Choose Columns¶

Keep Columns¶

Drop Columns¶

Empty Values¶

Drop Empty¶

Fill Empty¶

Fill Type¶

Fill Value¶

Fill Multiple Columns¶

Bin Column¶

Other Preprocessing¶

Normalize¶

Autoclean¶