PARSES SEVERAL CSVS FILES BASED ON EXTENSION AND COLLECTS THE COLUMNS TO CREATE A DATASET¶
Scans multiple CSV files (filtered by file extension) and merges selected columns across all matching files into a single dataset. Use this worker when you need to consolidate specific columns from a batch of CSV files into one unified tabular dataset for downstream analysis or modeling.
When to use¶
Classification: process.
Tagged: batch, column-extraction, csv, dataset-builder, ingest, multi-file, parse.
Inputs¶
| Label | ID | Type | Default | Required | Description |
|---|---|---|---|---|---|
| Files Extensions Separated By Comma | files_extensions_separatedby_comma | string | — | Comma-separated list of file extensions to match when scanning for input files (e.g. ‘csv,txt’); leave blank to process all files regardless of extension. | |
| Line Delimiter | line_delimiter | string | , | Character used to delimit fields within each CSV row (e.g. ‘,’ for standard CSV, ‘t’ for TSV); defaults to comma if left empty. | |
| Header Names | header_names | string | — | Comma-separated list of custom column header names to assign to the extracted columns; leave blank to use the header row already present in each file. | |
| Column Ids | column_ids | string | — | Comma-separated list of column indices or names to extract from each CSV file; leave blank to collect all columns. | |
| Starting Row Id | starting_row_id | integer | — | Zero-based or one-based row index at which to begin reading data from each file; leave blank to start from the first data row. | |
| Ending Row Id | ending_row_id | integer | — | Row index at which to stop reading data from each file (inclusive); leave blank to read all rows through the end of each file. | |
| Replace Values | replace_values | string | — | Mapping of values to find and replace in the collected data, expressed as a delimited key-value string (e.g. ‘N/A:0,null:0’); leave blank to perform no substitutions. |
Outputs¶
| Label | ID | Type | Description |
|---|---|---|---|
| csv_collect_columns_to_dataset_output_1 | csv_collect_columns_to_dataset_output_1 | dataset | Consolidated dataset containing the extracted and merged columns from all matched CSV files, ready for downstream transformation, analysis, or model ingestion. |
Disciplines¶
- data.dataset.ingest
- data.dataset.transform
- data.io.csv
Auto-generated from transformation schema. Worker id: csv_collect_columns_to_dataset. Schema hash: b7f6c14d7fac. Hand-curated docs in workerexamples/ override this page when present.