.. _auto_dataset_generate_synthetic_data: *GENERATE SYNTHETIC DATA (DENSIFY DATASET)* =========================================== Adds synthetic rows to a dataset by interpolating between existing rows in a user-selected feature space. When a prediction-error column is supplied, anchors are sampled with probability proportional to the (transformed) absolute error so new points cluster around regions where a surrogate or ML model is least accurate (active-learning / adaptive sampling). When no error column is supplied, anchors are sampled by mean nearest-neighbour distance so new points fill gaps in the design space (space-filling). Distances are Euclidean over the selected feature columns, optionally min-max normalised so columns with different physical units contribute comparably. Each synthetic row is a linear blend of an anchor and one of its top-k nearest neighbours; non-feature numeric columns are also linearly blended, while non-numeric columns are copied from the anchor. When to use ----------- Tagged: ``active_learning``, ``adaptive_sampling``, ``data_augmentation``, ``doe``, ``euclidean``, ``nearest_neighbor``, ``prediction_error``, ``space_filling``. Inputs ------ .. list-table:: :header-rows: 1 :widths: 20 20 20 20 20 20 * - Label - ID - Type - Default - Required - Description * - Dataset - dataset - dataset - — - ✓ - Input dataset to densify; must contain every column listed in feature_columns and (optionally) prediction_error_column. Typical inputs: DOE results, ML training tables. * - Feature Columns - feature_columns - text - — - ✓ - Comma-separated list of numeric column names that define the feature space. Distances and interpolation operate exclusively over these columns. Example: 'EMOD, TENMAX, GCTEN'. * - Prediction Error Column (optional) - prediction_error_column - text - — - - Name of the column containing the prediction error or goodness score. When provided the worker runs in error-driven mode (more synthetic neighbours around the worst rows). When blank it runs in space-filling mode (more synthetic neighbours in sparse regions of the feature space). The polarity (whether HIGH or LOW values mean 'worse') is controlled by error_polarity — defaults to higher_is_worse for residuals / RMSE, switch to lower_is_worse for accuracy / R^2. Only the absolute value is read. * - Mode - mode - select - auto - - Anchor-selection strategy. 'auto' picks error_driven when prediction_error_column is set, space_filling otherwise. Set explicitly to override. * - Synthesis Method - method - select - neighbor_interpolation - - How synthetic rows are generated. Today only 'neighbor_interpolation' is implemented; this input is forward-looking for future synthesis algorithms (perturbation, SMOTE, copula sampling, generative models). * - Number of New Points - n_new_points - text - — - - How many synthetic rows to add. Default 10. The output dataset has (input_rows + n_new_points) rows. * - Top-k Nearest Neighbours - k_neighbors - text - — - - When generating each synthetic row, the worker picks an anchor i and a partner j from the top-k Euclidean nearest neighbours of i. k=1 always blends with the single closest neighbour (most conservative); higher k spreads new points across more directions. Default 1. * - Error Weighting - error_weighting - select - linear - - Function applied to the badness score (/error/ when error_polarity=higher_is_worse, or max/error/-/error/ when lower_is_worse) to obtain anchor sampling weights. 'none' = uniform (ignores magnitude); 'linear' = weight ~ badness; 'squared' = weight ~ badness^2 (concentrates points on worst-fit rows); 'exp' = weight ~ exp(badness) - 1 (very aggressive). Ignored when prediction_error_column is empty. * - Error Polarity - error_polarity - select - higher_is_worse - - How to interpret the values in prediction_error_column. 'higher_is_worse' (default) treats larger absolute values as worse predictions — anchors are sampled toward those rows (classic active learning on residuals / RMSE). 'lower_is_worse' treats smaller values as worse — useful when the column actually carries a goodness metric like accuracy or R^2 where small numbers mean a poor fit; the worker inverts the score so anchors cluster around the LOW-value rows. Ignored when prediction_error_column is empty. * - Interpolation Method - interpolation_method - select - midpoint - - How a synthetic row is positioned along the segment from anchor i to neighbour j. 'midpoint' is deterministic (t=0.5). 'random' draws t uniformly from [0.2, 0.8] (avoids the endpoints). * - Normalise Features - normalize_features - select - yes - - Whether to min-max normalise each feature column to [0, 1] before computing Euclidean distances. Recommended (yes) when feature columns are in different physical units / scales. Has no effect on the interpolated output values, which are always emitted in the original (unnormalised) units. * - Mark Synthetic Rows - mark_synthetic - select - yes - - When 'yes' the augmented dataset gains four bookkeeping columns: is_synthetic (1 for new, 0 for original), anchor_row_id, neighbor_row_id, interp_t. * - Random Seed (optional) - random_seed - text - — - - Optional integer seed for the RNG that drives anchor / neighbour sampling. Set this to make the output reproducible across runs. Leave blank for a non-deterministic seed. Outputs ------- .. list-table:: :header-rows: 1 :widths: 20 20 20 20 * - Label - ID - Type - Description * - Augmented Dataset - dataset - dataset - Original rows followed by the synthetic rows. When 'Mark Synthetic Rows' is yes, every row carries is_synthetic / anchor_row_id / neighbor_row_id / interp_t bookkeeping columns. * - Synthetic Rows Only - new_rows - dataset - The newly-generated rows by themselves - convenient for piping to a simulation submission worker or to a Reporter. * - Summary - summary - string - Plain-text status: mode chosen, row counts, mean nearest-neighbour distance before densification, error statistics when applicable. Disciplines ----------- - ai_ml.preprocessing - data.dataset.transform - design_exploration.doe Runnable example ---------------- A runnable example is registered for this worker. Open the example workflow on the d3VIEW canvas: `/api/workflow/example?id=dataset_generate_synthetic_data `_ .. raw:: html

Auto-generated from platform schema. Worker id: dataset_generate_synthetic_data. Schema hash: d93b5734e8e4. Hand-curated docs in workerexamples/ override this page when present.