GENERATE SYNTHETIC DATA (DENSIFY DATASET)¶
Adds synthetic rows to a dataset by interpolating between existing rows in a user-selected feature space. When a prediction-error column is supplied, anchors are sampled with probability proportional to the (transformed) absolute error so new points cluster around regions where a surrogate or ML model is least accurate (active-learning / adaptive sampling). When no error column is supplied, anchors are sampled by mean nearest-neighbour distance so new points fill gaps in the design space (space-filling). Distances are Euclidean over the selected feature columns, optionally min-max normalised so columns with different physical units contribute comparably. Each synthetic row is a linear blend of an anchor and one of its top-k nearest neighbours; non-feature numeric columns are also linearly blended, while non-numeric columns are copied from the anchor.
When to use¶
Tagged: active_learning, adaptive_sampling, data_augmentation, doe, euclidean, nearest_neighbor, prediction_error, space_filling.
Inputs¶
| Label | ID | Type | Default | Required | Description |
|---|---|---|---|---|---|
| Dataset | dataset | dataset | — | ✓ | Input dataset to densify; must contain every column listed in feature_columns and (optionally) prediction_error_column. Typical inputs: DOE results, ML training tables. |
| Feature Columns | feature_columns | text | — | ✓ | Comma-separated list of numeric column names that define the feature space. Distances and interpolation operate exclusively over these columns. Example: ‘EMOD, TENMAX, GCTEN’. |
| Prediction Error Column (optional) | prediction_error_column | text | — | Name of the column containing the prediction error or goodness score. When provided the worker runs in error-driven mode (more synthetic neighbours around the worst rows). When blank it runs in space-filling mode (more synthetic neighbours in sparse regions of the feature space). The polarity (whether HIGH or LOW values mean ‘worse’) is controlled by error_polarity — defaults to higher_is_worse for residuals / RMSE, switch to lower_is_worse for accuracy / R^2. Only the absolute value is read. | |
| Mode | mode | select | auto | Anchor-selection strategy. ‘auto’ picks error_driven when prediction_error_column is set, space_filling otherwise. Set explicitly to override. | |
| Synthesis Method | method | select | neighbor_interpolation | How synthetic rows are generated. Today only ‘neighbor_interpolation’ is implemented; this input is forward-looking for future synthesis algorithms (perturbation, SMOTE, copula sampling, generative models). | |
| Number of New Points | n_new_points | text | — | How many synthetic rows to add. Default 10. The output dataset has (input_rows + n_new_points) rows. | |
| Top-k Nearest Neighbours | k_neighbors | text | — | When generating each synthetic row, the worker picks an anchor i and a partner j from the top-k Euclidean nearest neighbours of i. k=1 always blends with the single closest neighbour (most conservative); higher k spreads new points across more directions. Default 1. | |
| Error Weighting | error_weighting | select | linear | Function applied to the badness score (/error/ when error_polarity=higher_is_worse, or max/error/-/error/ when lower_is_worse) to obtain anchor sampling weights. ‘none’ = uniform (ignores magnitude); ‘linear’ = weight ~ badness; ‘squared’ = weight ~ badness^2 (concentrates points on worst-fit rows); ‘exp’ = weight ~ exp(badness) - 1 (very aggressive). Ignored when prediction_error_column is empty. | |
| Error Polarity | error_polarity | select | higher_is_worse | How to interpret the values in prediction_error_column. ‘higher_is_worse’ (default) treats larger absolute values as worse predictions — anchors are sampled toward those rows (classic active learning on residuals / RMSE). ‘lower_is_worse’ treats smaller values as worse — useful when the column actually carries a goodness metric like accuracy or R^2 where small numbers mean a poor fit; the worker inverts the score so anchors cluster around the LOW-value rows. Ignored when prediction_error_column is empty. | |
| Interpolation Method | interpolation_method | select | midpoint | How a synthetic row is positioned along the segment from anchor i to neighbour j. ‘midpoint’ is deterministic (t=0.5). ‘random’ draws t uniformly from [0.2, 0.8] (avoids the endpoints). | |
| Normalise Features | normalize_features | select | yes | Whether to min-max normalise each feature column to [0, 1] before computing Euclidean distances. Recommended (yes) when feature columns are in different physical units / scales. Has no effect on the interpolated output values, which are always emitted in the original (unnormalised) units. | |
| Mark Synthetic Rows | mark_synthetic | select | yes | When ‘yes’ the augmented dataset gains four bookkeeping columns: is_synthetic (1 for new, 0 for original), anchor_row_id, neighbor_row_id, interp_t. | |
| Random Seed (optional) | random_seed | text | — | Optional integer seed for the RNG that drives anchor / neighbour sampling. Set this to make the output reproducible across runs. Leave blank for a non-deterministic seed. |
Outputs¶
| Label | ID | Type | Description |
|---|---|---|---|
| Augmented Dataset | dataset | dataset | Original rows followed by the synthetic rows. When ‘Mark Synthetic Rows’ is yes, every row carries is_synthetic / anchor_row_id / neighbor_row_id / interp_t bookkeeping columns. |
| Synthetic Rows Only | new_rows | dataset | The newly-generated rows by themselves - convenient for piping to a simulation submission worker or to a Reporter. |
| Summary | summary | string | Plain-text status: mode chosen, row counts, mean nearest-neighbour distance before densification, error statistics when applicable. |
Disciplines¶
- ai_ml.preprocessing
- data.dataset.transform
- design_exploration.doe
Runnable example¶
A runnable example is registered for this worker. Open the example workflow on the d3VIEW canvas: /api/workflow/example?id=dataset_generate_synthetic_data
Auto-generated from platform schema. Worker id: dataset_generate_synthetic_data. Schema hash: d93b5734e8e4. Hand-curated docs in workerexamples/ override this page when present.