4.2. Explore¶
The ML Explore Plugin provides data insights from individual features and how features relate to a target.
python bin/lucy.egg plugins ml_explore -h
Explore Arguments¶
Primary¶
Main specifications for an Explore task.
Argument | Detail |
---|---|
-input | PATH. File path to the input data. |
-output | PATH. File path to where the output summary file will be saved. It is recommend that this file be saved as a JSON. However, if ‘-get-schema’ is the only explore task that was used, the output can be saved as a CSV. |
-clabel | STRING. The name of the Target/Response/output column. |
Preprocessing¶
Options for manipulating the dataset before using it.
Argument | Detail |
---|---|
-independents | COMMA SEPARATED LIST. A subset of column names that will be used for Exploring. |
-normalize | METHOD. (‘standard’, ‘minmax’) Normalizes the columns of ALL numerical data using the selected scaling method. |
-autoclean | NONE. The use of this command cleans the data based on a pre-defined set of rules. |
Data Insight Tasks¶
The main tasks for Explore.
Argument | Detail |
---|---|
-get-schema | NONE. Produces the information about each column: count, null count, unique count, data type. Depending on the data type provides additional information: range, mean, most frequent value, bins, etc. |
-feature-importance | NONE. Provides a score associated with each input feature and its relation to the picked ‘-clabel’. There are various measures for this; scores of the same type can be compared but different score types should not be compared to one another. |
Explore Examples¶
Examples to demonstrate how to use Explore arguments when performing an Explore task with a Lucy Build.
Primary¶
These arguments are the main drivers for an Explore tasks. Without these, some/all Explore task will not be completed in a meaningful way.
Input/Output¶
A Explore routine begins by choosing the file path of where the input data is to
be read from. This is done using the -input
argument.
An Explore routine ends where the output data will be saved. This file holds
the results for the task that has been completed. An output is saved using the
-output
command.
C-Label¶
This is used when a user wishes to specify a target column. This is only important when using the Feature Importance task.
Preprocessing¶
There are a few options for manipulating the dataset before using it in an Explore task.
Independents¶
A subset of columns can be used instead of the entire dataset. In the event that a C-Label is specified, the C-Label column is kept along with the columns selected with independents.
Autoclean¶
Autoclean is a way to clean a dataset before before performing the Explore task. Details of the exact procedures performed by Autoclean can be found at Clean’s Autoclean
The argument -autoclean
is used in order to clean the dataset before the
Learn task is applied.
python bin/lucy.egg plugins ml_explore -input ./input_data.csv -output
./out.json -get-schema -autoclean
Data Insight Tasks¶
Tasks can be performed individually or together.
Get Schema¶
Produces information about each column.
All columns have the following information returned:
- id - name of the column
- dataType - the type of data found in the column (text or number)
- count - the length of the dataset
- null_count - number of empty values found in the column
- unique_count - the number of unique values found in a column
If dataType is “text”, the additional information is provided:
- most_frequent - the top 5 most common values in the column along with the number of times that value appeared in the column
If dataType is “number”, the additional information is provided:
- std - the sample standard deviation of the column
- mean - the average value of the column
- min - the minimum value of the column
- 25% - the lower quartile of the column
- 50% - the median of the column
- 75% - the upper quartile of the column
- max - the maximum value of the column
Feature Importance¶
This task compares how the columns of the dataset compare to a specific column. The column of interest is specified using C-Label; meaning, C-Label is a requirement when using this task.
Lucy uses two metrics for measuring Feature Importance.
- Correlation - The correlation of each numeric input variable and a numeric Target
- Gini Importance - The importance measure of each input variable for a given target.
To obtain these, the option -feature-importance
is used with no argument
required. These values are populated when specifying a -clabel
and are
saved in the output file when specifying a path using -output
.
Consider the iris dataset and the following command:
python bin/lucy.egg plugins ml_explore -input iris.csv -output
iris_output.json -clabel class -feature-importance
The Lucy-Response is populated wth the following scores:
{
...
"payload": {
...
"feature_importance": {
"gini_importance": {
"sepal_width": 0.006145984326963746,
"petal_width": 0.4852350158451978,
"sepal_length": 0.0059189255008685404,
"petal_length": 0.50270007432697
}
}
}
}