.. _auto_dataset_generate_synthetic_data:

*GENERATE SYNTHETIC DATA (DENSIFY DATASET)*
===========================================

Adds synthetic rows to a dataset by interpolating between existing rows in a user-selected feature space. When a prediction-error column is supplied, anchors are sampled with probability proportional to the (transformed) absolute error so new points cluster around regions where a surrogate or ML model is least accurate (active-learning / adaptive sampling). When no error column is supplied, anchors are sampled by mean nearest-neighbour distance so new points fill gaps in the design space (space-filling). Distances are Euclidean over the selected feature columns, optionally min-max normalised so columns with different physical units contribute comparably. Each synthetic row is a linear blend of an anchor and one of its top-k nearest neighbours; non-feature numeric columns are also linearly blended, while non-numeric columns are copied from the anchor.

When to use
-----------

Tagged: ``active_learning``, ``adaptive_sampling``, ``data_augmentation``, ``doe``, ``euclidean``, ``nearest_neighbor``, ``prediction_error``, ``space_filling``.

Inputs
------

.. list-table::
   :header-rows: 1
   :widths: 20 20 20 20 20 20

   * - Label
     - ID
     - Type
     - Default
     - Required
     - Description
   * - Dataset
     - dataset
     - dataset
     - —
     - ✓
     - Input dataset to densify; must contain every column listed in feature_columns and (optionally) prediction_error_column. Typical inputs: DOE results, ML training tables.
   * - Feature Columns
     - feature_columns
     - text
     - —
     - ✓
     - Comma-separated list of numeric column names that define the feature space. Distances and interpolation operate exclusively over these columns. Example: 'EMOD, TENMAX, GCTEN'.
   * - Prediction Error Column (optional)
     - prediction_error_column
     - text
     - —
     - 
     - Name of the column containing the prediction error or goodness score. When provided the worker runs in error-driven mode (more synthetic neighbours around the worst rows). When blank it runs in space-filling mode (more synthetic neighbours in sparse regions of the feature space). The polarity (whether HIGH or LOW values mean 'worse') is controlled by error_polarity — defaults to higher_is_worse for residuals / RMSE, switch to lower_is_worse for accuracy / R^2. Only the absolute value is read.
   * - Mode
     - mode
     - select
     - auto
     - 
     - Anchor-selection strategy. 'auto' picks error_driven when prediction_error_column is set, space_filling otherwise. Set explicitly to override.
   * - Synthesis Method
     - method
     - select
     - neighbor_interpolation
     - 
     - How synthetic rows are generated. Today only 'neighbor_interpolation' is implemented; this input is forward-looking for future synthesis algorithms (perturbation, SMOTE, copula sampling, generative models).
   * - Number of New Points
     - n_new_points
     - text
     - —
     - 
     - How many synthetic rows to add. Default 10. The output dataset has (input_rows + n_new_points) rows.
   * - Top-k Nearest Neighbours
     - k_neighbors
     - text
     - —
     - 
     - When generating each synthetic row, the worker picks an anchor i and a partner j from the top-k Euclidean nearest neighbours of i. k=1 always blends with the single closest neighbour (most conservative); higher k spreads new points across more directions. Default 1.
   * - Error Weighting
     - error_weighting
     - select
     - linear
     - 
     - Function applied to the badness score (/error/ when error_polarity=higher_is_worse, or max/error/-/error/ when lower_is_worse) to obtain anchor sampling weights. 'none' = uniform (ignores magnitude); 'linear' = weight ~ badness; 'squared' = weight ~ badness^2 (concentrates points on worst-fit rows); 'exp' = weight ~ exp(badness) - 1 (very aggressive). Ignored when prediction_error_column is empty.
   * - Error Polarity
     - error_polarity
     - select
     - higher_is_worse
     - 
     - How to interpret the values in prediction_error_column. 'higher_is_worse' (default) treats larger absolute values as worse predictions — anchors are sampled toward those rows (classic active learning on residuals / RMSE). 'lower_is_worse' treats smaller values as worse — useful when the column actually carries a goodness metric like accuracy or R^2 where small numbers mean a poor fit; the worker inverts the score so anchors cluster around the LOW-value rows. Ignored when prediction_error_column is empty.
   * - Interpolation Method
     - interpolation_method
     - select
     - midpoint
     - 
     - How a synthetic row is positioned along the segment from anchor i to neighbour j. 'midpoint' is deterministic (t=0.5). 'random' draws t uniformly from [0.2, 0.8] (avoids the endpoints).
   * - Normalise Features
     - normalize_features
     - select
     - yes
     - 
     - Whether to min-max normalise each feature column to [0, 1] before computing Euclidean distances. Recommended (yes) when feature columns are in different physical units / scales. Has no effect on the interpolated output values, which are always emitted in the original (unnormalised) units.
   * - Mark Synthetic Rows
     - mark_synthetic
     - select
     - yes
     - 
     - When 'yes' the augmented dataset gains four bookkeeping columns: is_synthetic (1 for new, 0 for original), anchor_row_id, neighbor_row_id, interp_t.
   * - Random Seed (optional)
     - random_seed
     - text
     - —
     - 
     - Optional integer seed for the RNG that drives anchor / neighbour sampling. Set this to make the output reproducible across runs. Leave blank for a non-deterministic seed.

Outputs
-------

.. list-table::
   :header-rows: 1
   :widths: 20 20 20 20

   * - Label
     - ID
     - Type
     - Description
   * - Augmented Dataset
     - dataset
     - dataset
     - Original rows followed by the synthetic rows. When 'Mark Synthetic Rows' is yes, every row carries is_synthetic / anchor_row_id / neighbor_row_id / interp_t bookkeeping columns.
   * - Synthetic Rows Only
     - new_rows
     - dataset
     - The newly-generated rows by themselves - convenient for piping to a simulation submission worker or to a Reporter.
   * - Summary
     - summary
     - string
     - Plain-text status: mode chosen, row counts, mean nearest-neighbour distance before densification, error statistics when applicable.

Disciplines
-----------

- ai_ml.preprocessing
- data.dataset.transform
- design_exploration.doe

Runnable example
----------------

A runnable example is registered for this worker. Open the example workflow on the d3VIEW canvas: `/api/workflow/example?id=dataset_generate_synthetic_data <https://www.d3view.com/api/workflow/example?id=dataset_generate_synthetic_data>`_

.. raw:: html

   <hr style="margin-top:2em">
   <p style="font-size:11px;color:#888">
   Auto-generated from <code>platform</code> schema. Worker id: <code>dataset_generate_synthetic_data</code>. Schema hash: <code>d93b5734e8e4</code>. Hand-curated docs in <code>workerexamples/</code> override this page when present.
   </p>