4.2. Explore¶

The ML Explore Plugin provides data insights from individual features and how features relate to a target.

python bin/lucy.egg plugins ml_explore -h

Explore Arguments¶

Primary¶

Main specifications for an Explore task.

Argument	Detail
-input	PATH. File path to the input data.
-output	PATH. File path to where the output summary file will be saved. It is recommend that this file be saved as a JSON. However, if ‘-get-schema’ is the only explore task that was used, the output can be saved as a CSV.
-clabel	STRING. The name of the Target/Response/output column.

Preprocessing¶

Options for manipulating the dataset before using it.

Argument	Detail
-independents	COMMA SEPARATED LIST. A subset of column names that will be used for Exploring.
-normalize	METHOD. (‘standard’, ‘minmax’) Normalizes the columns of ALL numerical data using the selected scaling method.
-autoclean	NONE. The use of this command cleans the data based on a pre-defined set of rules.

Data Insight Tasks¶

The main tasks for Explore.

Argument	Detail
-get-schema	NONE. Produces the information about each column: count, null count, unique count, data type. Depending on the data type provides additional information: range, mean, most frequent value, bins, etc.
-feature-importance	NONE. Provides a score associated with each input feature and its relation to the picked ‘-clabel’. There are various measures for this; scores of the same type can be compared but different score types should not be compared to one another.

Explore Examples¶

Examples to demonstrate how to use Explore arguments when performing an Explore task with a Lucy Build.

Primary¶

These arguments are the main drivers for an Explore tasks. Without these, some/all Explore task will not be completed in a meaningful way.

Input/Output¶

A Explore routine begins by choosing the file path of where the input data is to be read from. This is done using the -input argument.

An Explore routine ends where the output data will be saved. This file holds the results for the task that has been completed. An output is saved using the -output command.

C-Label¶

This is used when a user wishes to specify a target column. This is only important when using the Feature Importance task.

Preprocessing¶

There are a few options for manipulating the dataset before using it in an Explore task.

Independents¶

A subset of columns can be used instead of the entire dataset. In the event that a C-Label is specified, the C-Label column is kept along with the columns selected with independents.

Autoclean¶

Autoclean is a way to clean a dataset before before performing the Explore task. Details of the exact procedures performed by Autoclean can be found at Clean’s Autoclean

The argument -autoclean is used in order to clean the dataset before the Learn task is applied.

python bin/lucy.egg plugins ml_explore -input ./input_data.csv -output ./out.json -get-schema -autoclean

Data Insight Tasks¶

Tasks can be performed individually or together.

Get Schema¶

Produces information about each column.

All columns have the following information returned:

id - name of the column

dataType - the type of data found in the column (text or number)

count - the length of the dataset

null_count - number of empty values found in the column

unique_count - the number of unique values found in a column

If dataType is “text”, the additional information is provided:

most_frequent - the top 5 most common values in the column along with the number of times that value appeared in the column

If dataType is “number”, the additional information is provided:

std - the sample standard deviation of the column

mean - the average value of the column

min - the minimum value of the column

25% - the lower quartile of the column

50% - the median of the column

75% - the upper quartile of the column

max - the maximum value of the column

Feature Importance¶

This task compares how the columns of the dataset compare to a specific column. The column of interest is specified using C-Label; meaning, C-Label is a requirement when using this task.

Lucy uses two metrics for measuring Feature Importance.

Correlation - The correlation of each numeric input variable and a numeric Target

Gini Importance - The importance measure of each input variable for a given target.

To obtain these, the option -feature-importance is used with no argument required. These values are populated when specifying a -clabel and are saved in the output file when specifying a path using -output.

Consider the iris dataset and the following command:

python bin/lucy.egg plugins ml_explore -input iris.csv -output iris_output.json -clabel class -feature-importance

The Lucy-Response is populated wth the following scores:

{
    ...
    "payload": {
        ...
        "feature_importance": {
            "gini_importance": {
                "sepal_width": 0.006145984326963746,
                "petal_width": 0.4852350158451978,
                "sepal_length": 0.0059189255008685404,
                "petal_length": 0.50270007432697
            }
        }
    }
}