GENERATE SYNTHETIC DATA (DENSIFY DATASET)

Adds synthetic rows to a dataset by interpolating between existing rows in a user-selected feature space. When a prediction-error column is supplied, anchors are sampled with probability proportional to the (transformed) absolute error so new points cluster around regions where a surrogate or ML model is least accurate (active-learning / adaptive sampling). When no error column is supplied, anchors are sampled by mean nearest-neighbour distance so new points fill gaps in the design space (space-filling). Distances are Euclidean over the selected feature columns, optionally min-max normalised so columns with different physical units contribute comparably. Each synthetic row is a linear blend of an anchor and one of its top-k nearest neighbours; non-feature numeric columns are also linearly blended, while non-numeric columns are copied from the anchor.

When to use

Tagged: active_learning, adaptive_sampling, data_augmentation, doe, euclidean, nearest_neighbor, prediction_error, space_filling.

Inputs

Label ID Type Default Required Description
Dataset dataset dataset Input dataset to densify; must contain every column listed in feature_columns and (optionally) prediction_error_column. Typical inputs: DOE results, ML training tables.
Feature Columns feature_columns text Comma-separated list of numeric column names that define the feature space. Distances and interpolation operate exclusively over these columns. Example: ‘EMOD, TENMAX, GCTEN’.
Prediction Error Column (optional) prediction_error_column text   Name of the column containing the prediction error or goodness score. When provided the worker runs in error-driven mode (more synthetic neighbours around the worst rows). When blank it runs in space-filling mode (more synthetic neighbours in sparse regions of the feature space). The polarity (whether HIGH or LOW values mean ‘worse’) is controlled by error_polarity — defaults to higher_is_worse for residuals / RMSE, switch to lower_is_worse for accuracy / R^2. Only the absolute value is read.
Mode mode select auto   Anchor-selection strategy. ‘auto’ picks error_driven when prediction_error_column is set, space_filling otherwise. Set explicitly to override.
Synthesis Method method select neighbor_interpolation   How synthetic rows are generated. Today only ‘neighbor_interpolation’ is implemented; this input is forward-looking for future synthesis algorithms (perturbation, SMOTE, copula sampling, generative models).
Number of New Points n_new_points text   How many synthetic rows to add. Default 10. The output dataset has (input_rows + n_new_points) rows.
Top-k Nearest Neighbours k_neighbors text   When generating each synthetic row, the worker picks an anchor i and a partner j from the top-k Euclidean nearest neighbours of i. k=1 always blends with the single closest neighbour (most conservative); higher k spreads new points across more directions. Default 1.
Error Weighting error_weighting select linear   Function applied to the badness score (/error/ when error_polarity=higher_is_worse, or max/error/-/error/ when lower_is_worse) to obtain anchor sampling weights. ‘none’ = uniform (ignores magnitude); ‘linear’ = weight ~ badness; ‘squared’ = weight ~ badness^2 (concentrates points on worst-fit rows); ‘exp’ = weight ~ exp(badness) - 1 (very aggressive). Ignored when prediction_error_column is empty.
Error Polarity error_polarity select higher_is_worse   How to interpret the values in prediction_error_column. ‘higher_is_worse’ (default) treats larger absolute values as worse predictions — anchors are sampled toward those rows (classic active learning on residuals / RMSE). ‘lower_is_worse’ treats smaller values as worse — useful when the column actually carries a goodness metric like accuracy or R^2 where small numbers mean a poor fit; the worker inverts the score so anchors cluster around the LOW-value rows. Ignored when prediction_error_column is empty.
Interpolation Method interpolation_method select midpoint   How a synthetic row is positioned along the segment from anchor i to neighbour j. ‘midpoint’ is deterministic (t=0.5). ‘random’ draws t uniformly from [0.2, 0.8] (avoids the endpoints).
Normalise Features normalize_features select yes   Whether to min-max normalise each feature column to [0, 1] before computing Euclidean distances. Recommended (yes) when feature columns are in different physical units / scales. Has no effect on the interpolated output values, which are always emitted in the original (unnormalised) units.
Mark Synthetic Rows mark_synthetic select yes   When ‘yes’ the augmented dataset gains four bookkeeping columns: is_synthetic (1 for new, 0 for original), anchor_row_id, neighbor_row_id, interp_t.
Random Seed (optional) random_seed text   Optional integer seed for the RNG that drives anchor / neighbour sampling. Set this to make the output reproducible across runs. Leave blank for a non-deterministic seed.

Outputs

Label ID Type Description
Augmented Dataset dataset dataset Original rows followed by the synthetic rows. When ‘Mark Synthetic Rows’ is yes, every row carries is_synthetic / anchor_row_id / neighbor_row_id / interp_t bookkeeping columns.
Synthetic Rows Only new_rows dataset The newly-generated rows by themselves - convenient for piping to a simulation submission worker or to a Reporter.
Summary summary string Plain-text status: mode chosen, row counts, mean nearest-neighbour distance before densification, error statistics when applicable.

Disciplines

  • ai_ml.preprocessing
  • data.dataset.transform
  • design_exploration.doe

Runnable example

A runnable example is registered for this worker. Open the example workflow on the d3VIEW canvas: /api/workflow/example?id=dataset_generate_synthetic_data


Auto-generated from platform schema. Worker id: dataset_generate_synthetic_data. Schema hash: d93b5734e8e4. Hand-curated docs in workerexamples/ override this page when present.