infer.steps array of an ingestion request. Steps execute in order: the output of each step becomes the input for the next.
Step types at a glance
| Step | type value | Purpose |
|---|---|---|
| Table extraction | table | Extract structured columns from unstructured text |
| Rows | rows | Expand a JSON array stored in a column into rows |
| Normalize | normalize | Map extracted column names to a target schema |
| Validate | validate | Check data quality and flag invalid rows |
| ReverseGeo | reverseGeo | Add latitude/longitude from location names |
How the pipeline works
The inference can have multiple steps which are combined by our AI. Steps execute sequentially in the order you list them. A typical pipeline follows this pattern:Step reference
Table extraction
The Table step is designed to extract a table from any text input, whether it’s free-form text such as a restaurant review, a flight log report, or a financial instrument notice. When used independently, this step may produce columns that are somewhat arbitrary and often require normalization afterward. It functions by extracting data to create new columns and appending them to the rows of the existing table.| Field | Required | Description |
|---|---|---|
type | Yes | Must be "table" |
deriveFrom | Yes | Name of the column containing unstructured text |
helperPrompt | No | Instructions to the LLM about what to extract and how |
phone_number vs telephone). Follow table extraction with a normalize step to consolidate these into a consistent schema.
Rows
If a table is stored in JSON format within a column, you can directly transform it into a structured table. The JSON should consist of an array of objects, with each key in the objects mapped to a new column.| Field | Required | Description |
|---|---|---|
type | Yes | Must be "rows" |
deriveFrom | Yes | Name of the column containing a JSON array |
[{"name": "Alice", "age": 34}, {"name": "Bob", "age": 28}], the Rows step produces two rows with name and age columns.
Normalize
Use the Normalize step to consolidate the schema or column space. In this step, specify the columns you want to include in the final table that will be indexed into DataLinks.| Field | Required | Description |
|---|---|---|
type | Yes | Must be "normalize" (or "normalise") |
mode | Yes | One of "all-in-one", "field-by-field", or "embeddings" |
targetCols | Yes | Object mapping desired column names to descriptions |
helperPrompt | No | Domain context to help the LLM interpret column names |
Choosing a mode
The normalize step supports three different modes:| Mode | How it works | Best for |
|---|---|---|
all-in-one | Uses a single prompt to normalize all columns at once | Good default for most cases. Fast and effective when columns are reasonably self-explanatory. |
field-by-field | Normalizes each field individually, which can be more accurate for complex schemas | Complex schemas where columns may be ambiguous. More accurate but slower. |
embeddings | Uses embeddings to match column names, which can be faster and more efficient for large datasets | High-volume ingestion with predictable, consistent source schemas. Fastest option. |
targetCols matter. They are the primary signal the LLM uses to decide which extracted column maps to which target. Be specific: "Email address" is better than "email", and "Annual salary in USD" is better than "salary".
Validate
The Validate step ensures data quality by validating the content of specified columns. It supports three validation modes:| Field | Required | Description |
|---|---|---|
type | Yes | Must be "validate" |
mode | Yes | One of "regex", "rows", or "fields" |
columns | Yes | Array of column names to validate |
Validation modes
- regex: Validates columns using regular expressions
- rows: Validates entire rows based on the specified columns
- fields: Validates individual fields in the specified columns
__valid field indicating whether the validation passed.
ReverseGeo
The ReverseGeo step adds geographical coordinates (latitude and longitude) based on location names in a specified column.Note: ReverseGeo is an optional feature not enabled by default. Contact your account representative to add it to your workspace.
| Field | Required | Description |
|---|---|---|
type | Yes | Must be "reverseGeo" |
deriveFrom | Yes | Name of the column containing location names |
{locationColumnName}_latlong containing the coordinates.
Example output:
| city | city_latlong |
|---|---|
| New York | 40.7128,-74.0060 |
| London | 51.5074,-0.1278 |
Writing effective helper prompts
ThehelperPrompt field is available on both the table extraction and normalize steps. It is the most impactful parameter for inference quality. A good prompt gives the LLM the context it needs to make accurate decisions.
Be specific about the schema you expect. Name the columns and describe their format:
Choosing an LLM
Steps that use LLM inference (table extraction and normalize) accept optionalmodel and provider fields to specify which model runs the step.
| Field | Required | Description |
|---|---|---|
model | No | The model identifier |
provider | No | The provider name (e.g., "openai", "ollama") |
Complete example
This example takes unstructured directory entries, extracts structured fields, normalizes the schema, and validates the results:Saving inference definitions
When you create a dataset, you can save a data description and field definition that DataLinks uses during future ingestion calls. This avoids repeating the same configuration on every request. Set the definition at creation time via theinferDefinition field on Create new dataset, or update it later with Update infer definition.