DATASET MERGE BASED ON EUCLIDEAN DISTANCE¶
Merges or removes rows in a dataset that are in close proximity to one another based on Euclidean distance computed over a user-specified set of columns. Rows whose pairwise distance falls within a scaled threshold are either averaged together (merged) or dropped, reducing near-duplicate design points. Use this worker to deduplicate DOE tables or simulation result datasets before downstream analysis.
When to use¶
Classification: process.
Tagged: dataset, deduplication, doe_cleanup, euclidean_distance, merge, proximity, row_reduction.
Inputs¶
| Label | ID | Type | Default | Required | Description |
|---|---|---|---|---|---|
| Dataset | dataset_1 | dataset | — | Input tabular dataset whose rows will be evaluated for proximity; connect any dataset output — leave unconnected only if the downstream transformation chain supplies data directly. | |
| Columns to Use for Proximity | columns | scalar | — | ✓ | One or more column names (comma-separated or multi-select) from dataset_1 that define the feature space over which Euclidean distance is calculated; must include all dimensions relevant to proximity judgement. |
| Threshold | threshold | scalar | 0.01 | Scaling ratio applied to each row’s Euclidean distance to derive the merge threshold (dimensionless, default 0.01); smaller values require rows to be very close before being merged or removed. | |
| Proximity Treatment | treatment | string | merge | Action taken on proximate rows: ‘merge’ (default) replaces the close-pair with their averaged values, or ‘yes’ removes the duplicate rows entirely. |
Outputs¶
| Label | ID | Type | Description |
|---|---|---|---|
| Merged Dataset | merged_dataset | dataset | Cleaned tabular dataset with near-duplicate rows either averaged into a single representative row or removed, depending on the chosen proximity treatment. |
Disciplines¶
- data.dataset.transform
- design_exploration.doe
Runnable example¶
A runnable example is registered for this worker. Open the example workflow on the d3VIEW canvas: /api/workflow/example?id=dataset_merge_based_on_euclidean_distance
Auto-generated from transformation schema. Worker id: dataset_merge_based_on_euclidean_distance. Schema hash: 6b2c9f644de3. Hand-curated docs in workerexamples/ override this page when present.