DATASET MERGE BASED ON EUCLIDEAN DISTANCE

Merges or removes rows in a dataset that are in close proximity to one another based on Euclidean distance computed over a user-specified set of columns. Rows whose pairwise distance falls within a scaled threshold are either averaged together (merged) or dropped, reducing near-duplicate design points. Use this worker to deduplicate DOE tables or simulation result datasets before downstream analysis.

When to use

Classification: process.

Tagged: dataset, deduplication, doe_cleanup, euclidean_distance, merge, proximity, row_reduction.

Inputs

Label ID Type Default Required Description
Dataset dataset_1 dataset   Input tabular dataset whose rows will be evaluated for proximity; connect any dataset output — leave unconnected only if the downstream transformation chain supplies data directly.
Columns to Use for Proximity columns scalar One or more column names (comma-separated or multi-select) from dataset_1 that define the feature space over which Euclidean distance is calculated; must include all dimensions relevant to proximity judgement.
Threshold threshold scalar 0.01   Scaling ratio applied to each row’s Euclidean distance to derive the merge threshold (dimensionless, default 0.01); smaller values require rows to be very close before being merged or removed.
Proximity Treatment treatment string merge   Action taken on proximate rows: ‘merge’ (default) replaces the close-pair with their averaged values, or ‘yes’ removes the duplicate rows entirely.

Outputs

Label ID Type Description
Merged Dataset merged_dataset dataset Cleaned tabular dataset with near-duplicate rows either averaged into a single representative row or removed, depending on the chosen proximity treatment.

Disciplines

  • data.dataset.transform
  • design_exploration.doe

Runnable example

A runnable example is registered for this worker. Open the example workflow on the d3VIEW canvas: /api/workflow/example?id=dataset_merge_based_on_euclidean_distance


Auto-generated from transformation schema. Worker id: dataset_merge_based_on_euclidean_distance. Schema hash: 6b2c9f644de3. Hand-curated docs in workerexamples/ override this page when present.