GENERATE SYNTHETIC DATA (DENSIFY DATASET)¶

Adds synthetic rows to a dataset by interpolating between existing rows in a user-selected feature space. When a prediction-error column is supplied, anchors are sampled with probability proportional to the (transformed) absolute error so new points cluster around regions where a surrogate or ML model is least accurate (active-learning / adaptive sampling). When no error column is supplied, anchors are sampled by mean nearest-neighbour distance so new points fill gaps in the design space (space-filling). Distances are Euclidean over the selected feature columns, optionally min-max normalised so columns with different physical units contribute comparably. Each synthetic row is a linear blend of an anchor and one of its top-k nearest neighbours; non-feature numeric columns are also linearly blended, while non-numeric columns are copied from the anchor.

When to use¶

Tagged: active_learning, adaptive_sampling, data_augmentation, doe, euclidean, nearest_neighbor, prediction_error, space_filling.

Inputs¶

Label	ID	Type	Default	Required	Description
Dataset	dataset	dataset	—	✓	Input dataset to densify; must contain every column listed in feature_columns and (optionally) prediction_error_column. Typical inputs: DOE results, ML training tables.
Feature Columns	feature_columns	text	—	✓	Comma-separated list of numeric column names that define the feature space. Distances and interpolation operate exclusively over these columns. Example: ‘EMOD, TENMAX, GCTEN’.
Prediction Error Column (optional)	prediction_error_column	text	—		Name of the column containing the prediction error or goodness score. When provided the worker runs in error-driven mode (more synthetic neighbours around the worst rows). When blank it runs in space-filling mode (more synthetic neighbours in sparse regions of the feature space). The polarity (whether HIGH or LOW values mean ‘worse’) is controlled by error_polarity — defaults to higher_is_worse for residuals / RMSE, switch to lower_is_worse for accuracy / R^2. Only the absolute value is read.
Mode	mode	select	auto		How synthetic points are generated. Pick one or more — each selected mode independently adds n_new_points rows. ‘error_driven’ clusters new points around the worst-fit rows (needs prediction_error_column). ‘space_filling’ spreads exactly n_new_points evenly across the empty regions of the feature space. ‘density’ keeps adding points until every region reaches a minimum neighbour count (driven by min_neighbors / neighbor_radius), up to the n_new_points budget. ‘auto’ picks error_driven when prediction_error_column is set, space_filling otherwise; pick density explicitly.
Synthesis Method	method	select	neighbor_interpolation		How synthetic rows are generated. Today only ‘neighbor_interpolation’ is implemented; this input is forward-looking for future synthesis algorithms (perturbation, SMOTE, copula sampling, generative models).
Number of New Points	n_new_points	text	—		How many synthetic rows to add. Default 10. The output dataset has (input_rows + n_new_points) rows.
Top-k Nearest Neighbours	k_neighbors	text	—		When generating each synthetic row, the worker picks an anchor i and a partner j from the top-k Euclidean nearest neighbours of i. k=1 always blends with the single closest neighbour (most conservative); higher k spreads new points across more directions. Default 1.
Error Weighting	error_weighting	select	linear		Function applied to the badness score (/error/ when error_polarity=higher_is_worse, or max/error/-/error/ when lower_is_worse) to obtain anchor sampling weights. ‘none’ = uniform (ignores magnitude); ‘linear’ = weight ~ badness; ‘squared’ = weight ~ badness^2 (concentrates points on worst-fit rows); ‘exp’ = weight ~ exp(badness) - 1 (very aggressive). Ignored when prediction_error_column is empty.
Error Polarity	error_polarity	select	higher_is_worse		How to interpret the values in prediction_error_column. ‘higher_is_worse’ (default) treats larger absolute values as worse predictions — anchors are sampled toward those rows (classic active learning on residuals / RMSE). ‘lower_is_worse’ treats smaller values as worse — useful when the column actually carries a goodness metric like accuracy or R^2 where small numbers mean a poor fit; the worker inverts the score so anchors cluster around the LOW-value rows. Ignored when prediction_error_column is empty.
Interpolation Method	interpolation_method	select	midpoint		How a synthetic row is positioned along the segment from anchor i to neighbour j. ‘midpoint’ is deterministic (t=0.5). ‘random’ draws t uniformly from [0.2, 0.8] (avoids the endpoints).
Normalise Features	normalize_features	select	yes		Whether to min-max normalise each feature column to [0, 1] before computing Euclidean distances. Recommended (yes) when feature columns are in different physical units / scales. Has no effect on the interpolated output values, which are always emitted in the original (unnormalised) units.
Mark Synthetic Rows	mark_synthetic	select	yes		When ‘yes’ the augmented dataset gains four bookkeeping columns: is_synthetic (1 for new, 0 for original), anchor_row_id, neighbor_row_id, interp_t.
Random Seed (optional)	random_seed	text	—		Optional integer seed for the RNG that drives anchor / neighbour sampling. Set this to make the output reproducible across runs. Leave blank for a non-deterministic seed.
Minimum Neighbours per Region (density mode)	min_neighbors	text	—		Density mode only. The worker keeps adding points until every point in the dataset has at least this many other points within neighbor_radius - it fills sparse regions up to this density floor. n_new_points acts as the maximum number of synthetic points to add. Defaults to 5 when Mode is density and this is left blank. Ignored in space_filling and error_driven modes.
Neighbour Radius (density mode)	neighbor_radius	text	—		Density mode only. Radius used to count neighbours for the min_neighbors density floor, measured in the normalised [0,1] feature space (so 0.1 means 10% of each feature’s range). Leave 0 (default) to auto-derive it from the mean nearest-neighbour distance of the input data.
Constrain to Yield Line	yield_line	select	no		When ‘yes’, every synthetic row is snapped onto a stress-strain line: the stress column is set to (slope * strain + intercept). Use this for material data where yield stress and yield strain are physically related - synthetic points then sit on the yield line instead of scattering freely. Applies to both space_filling and error_driven modes; original rows are never changed. Requires yield_strain_column and yield_stress_column to be set.
Yield Strain Column	yield_strain_column	text	—		Name of the column that holds yield strain (the x of the yield line). Only used when Constrain to Yield Line is ‘yes’.
Yield Stress Column	yield_stress_column	text	—		Name of the column that holds yield stress (the y of the yield line). When Constrain to Yield Line is ‘yes’, this column’s value on every synthetic row is overwritten with slope * strain + intercept.
Yield Line Method	yield_line_method	select	fixed_modulus		How the yield line is determined. ‘fixed_modulus’ uses stress = yield_modulus * strain (a line through the origin) - the classic metal elastic line. ‘linear_regression’ instead fits a least-squares line (slope and intercept) to the existing rows’ strain/stress data, which supports non-metallic materials and data whose trend does not pass through the origin. Only used when Constrain to Yield Line is ‘yes’.
Yield Modulus (E)	yield_modulus	text	210		The modulus E used when Yield Line Method is ‘fixed_modulus’; the yield line is stress = E * strain. Default 210 (metal elastic modulus in GPa - matches a stress column expressed in GPa). Set this to your material’s modulus, or switch Yield Line Method to linear_regression to derive the line from the data instead. Ignored when Yield Line Method is linear_regression.

Outputs¶

Label	ID	Type	Description
Augmented Dataset	dataset	dataset	Original rows followed by the synthetic rows. When ‘Mark Synthetic Rows’ is yes, every row carries is_synthetic / anchor_row_id / neighbor_row_id / interp_t bookkeeping columns.
Synthetic Rows Only	new_rows	dataset	The newly-generated rows by themselves - convenient for piping to a simulation submission worker or to a Reporter.
Summary	summary	string	Plain-text status: mode chosen, row counts, mean nearest-neighbour distance before densification, error statistics when applicable.

Disciplines¶

ai_ml.preprocessing
data.dataset.transform
design_exploration.doe

Runnable example¶

A runnable example is registered for this worker. Open the example workflow on the d3VIEW canvas: /api/workflow/example?id=dataset_generate_synthetic_data

Auto-generated from platform schema. Worker id: dataset_generate_synthetic_data. Schema hash: 7f6720131f09. Hand-curated docs in workerexamples/ override this page when present.