4.2. Explore

The ML Explore Plugin provides data insights from individual features and how features relate to a target.

python bin/lucy.egg plugins ml_explore -h

Explore Arguments

Primary

Main specifications for an Explore task.

Argument Detail
-input PATH. File path to the input data.
-output PATH. File path to where the output summary file will be saved. It is recommend that this file be saved as a JSON. However, if ‘-get-schema’ is the only explore task that was used, the output can be saved as a CSV.
-clabel STRING. The name of the Target/Response/output column.

Preprocessing

Options for manipulating the dataset before using it.

Argument Detail
-independents COMMA SEPARATED LIST. A subset of column names that will be used for Exploring.
-normalize METHOD. (‘standard’, ‘minmax’) Normalizes the columns of ALL numerical data using the selected scaling method.
-autoclean NONE. The use of this command cleans the data based on a pre-defined set of rules.

Data Insight Tasks

The main tasks for Explore.

Argument Detail
-get-schema NONE. Produces the information about each column: count, null count, unique count, data type. Depending on the data type provides additional information: range, mean, most frequent value, bins, etc.
-feature-importance NONE. Provides a score associated with each input feature and its relation to the picked ‘-clabel’. There are various measures for this; scores of the same type can be compared but different score types should not be compared to one another.

Explore Examples

Examples to demonstrate how to use Explore arguments when performing an Explore task with a Lucy Build.

Primary

These arguments are the main drivers for an Explore tasks. Without these, some/all Explore task will not be completed in a meaningful way.

Input/Output

A Explore routine begins by choosing the file path of where the input data is to be read from. This is done using the -input argument.

An Explore routine ends where the output data will be saved. This file holds the results for the task that has been completed. An output is saved using the -output command.

C-Label

This is used when a user wishes to specify a target column. This is only important when using the Feature Importance task.

Preprocessing

There are a few options for manipulating the dataset before using it in an Explore task.

Independents

A subset of columns can be used instead of the entire dataset. In the event that a C-Label is specified, the C-Label column is kept along with the columns selected with independents.

Autoclean

Autoclean is a way to clean a dataset before before performing the Explore task. Details of the exact procedures performed by Autoclean can be found at Clean’s Autoclean

The argument -autoclean is used in order to clean the dataset before the Learn task is applied.

python bin/lucy.egg plugins ml_explore -input ./input_data.csv -output ./out.json -get-schema -autoclean

Data Insight Tasks

Tasks can be performed individually or together.

Get Schema

Produces information about each column.

All columns have the following information returned:

  • id - name of the column
  • dataType - the type of data found in the column (text or number)
  • count - the length of the dataset
  • null_count - number of empty values found in the column
  • unique_count - the number of unique values found in a column

If dataType is “text”, the additional information is provided:

  • most_frequent - the top 5 most common values in the column along with the number of times that value appeared in the column

If dataType is “number”, the additional information is provided:

  • std - the sample standard deviation of the column
  • mean - the average value of the column
  • min - the minimum value of the column
  • 25% - the lower quartile of the column
  • 50% - the median of the column
  • 75% - the upper quartile of the column
  • max - the maximum value of the column

Feature Importance

This task compares how the columns of the dataset compare to a specific column. The column of interest is specified using C-Label; meaning, C-Label is a requirement when using this task.

Lucy uses two metrics for measuring Feature Importance.

  • Correlation - The correlation of each numeric input variable and a numeric Target
  • Gini Importance - The importance measure of each input variable for a given target.

To obtain these, the option -feature-importance is used with no argument required. These values are populated when specifying a -clabel and are saved in the output file when specifying a path using -output.

Consider the iris dataset and the following command:

python bin/lucy.egg plugins ml_explore -input iris.csv -output iris_output.json -clabel class -feature-importance

The Lucy-Response is populated wth the following scores:

{
    ...
    "payload": {
        ...
        "feature_importance": {
            "gini_importance": {
                "sepal_width": 0.006145984326963746,
                "petal_width": 0.4852350158451978,
                "sepal_length": 0.0059189255008685404,
                "petal_length": 0.50270007432697
            }
        }
    }
}