Skip to main content
When you ingest data into DataLinks, you can include an inference pipeline that transforms your data before it is stored. The pipeline is a sequence of steps that extract structure from unstructured text, normalize column names to a target schema, validate data quality, and enrich records with geographic coordinates. Inference steps are specified in the infer.steps array of an ingestion request. Steps execute in order: the output of each step becomes the input for the next.

Step types at a glance

Steptype valuePurpose
Table extractiontableExtract structured columns from unstructured text
RowsrowsExpand a JSON array stored in a column into rows
NormalizenormalizeMap extracted column names to a target schema
ValidatevalidateCheck data quality and flag invalid rows
ReverseGeoreverseGeoAdd latitude/longitude from location names

How the pipeline works

The inference can have multiple steps which are combined by our AI. Steps execute sequentially in the order you list them. A typical pipeline follows this pattern:
Raw data
  → Table extraction (unstructured text → columns)
    → Normalize (inconsistent column names → target schema)
      → Validate (flag rows that fail quality checks)
        → Stored in DataLinks
You can use any combination of steps. A single step is fine. So is skipping straight to normalize if your data is already tabular but has inconsistent column names. The full request structure looks like this:
import requests

response = requests.post(
    "https://api.prod.datalinks.com/api/v1/ingest/namespace/datasetName",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    json={
        "data": [
            {"col1": 123, "col2": "foo", "colWithATable": "direction,size;up,large"},
            {"col1": 444, "col2": "bar"}
        ],
        "infer": {
            "steps": [
                {
                    "type": "table",
                    "deriveFrom": "colWithATable",
                    "helperPrompt": "This text contains a table comma separated, but instead of line breaks we are using a ';'."
                }
            ]
        }
    }
)
In this specific example we specify we want a table extracted from a specific column. This will extract all columns available from the text provided. You can use the helperPrompt to guide the AI system to have higher accuracy during extraction.

Step reference

Table extraction

The Table step is designed to extract a table from any text input, whether it’s free-form text such as a restaurant review, a flight log report, or a financial instrument notice. When used independently, this step may produce columns that are somewhat arbitrary and often require normalization afterward. It functions by extracting data to create new columns and appending them to the rows of the existing table.
{
    "type": "table",
    "deriveFrom": "source_column",
    "helperPrompt": "Optional prompt to guide extraction."
}
FieldRequiredDescription
typeYesMust be "table"
deriveFromYesName of the column containing unstructured text
helperPromptNoInstructions to the LLM about what to extract and how
Pro tip: Calling an extraction with just the table step is a great way to see what kind of structured data can be generated from your unstructured data. When processing diverse inputs, the LLM will invent column names based on what it finds. Different rows may produce different names for the same concept (e.g., phone_number vs telephone). Follow table extraction with a normalize step to consolidate these into a consistent schema.

Rows

If a table is stored in JSON format within a column, you can directly transform it into a structured table. The JSON should consist of an array of objects, with each key in the objects mapped to a new column.
{
    "type": "rows",
    "deriveFrom": "json_column"
}
FieldRequiredDescription
typeYesMust be "rows"
deriveFromYesName of the column containing a JSON array
For example, if a column contains [{"name": "Alice", "age": 34}, {"name": "Bob", "age": 28}], the Rows step produces two rows with name and age columns.

Normalize

Use the Normalize step to consolidate the schema or column space. In this step, specify the columns you want to include in the final table that will be indexed into DataLinks.
{
    "type": "normalize",
    "mode": "all-in-one",
    "targetCols": {
        "full_name": "The person's full name, without titles like Dr. or Mr.",
        "age": "Age as a number",
        "company": "Company or organization name",
        "email": "Email address"
    },
    "helperPrompt": "This data comes from professional directory listings."
}
FieldRequiredDescription
typeYesMust be "normalize" (or "normalise")
modeYesOne of "all-in-one", "field-by-field", or "embeddings"
targetColsYesObject mapping desired column names to descriptions
helperPromptNoDomain context to help the LLM interpret column names

Choosing a mode

The normalize step supports three different modes:
ModeHow it worksBest for
all-in-oneUses a single prompt to normalize all columns at onceGood default for most cases. Fast and effective when columns are reasonably self-explanatory.
field-by-fieldNormalizes each field individually, which can be more accurate for complex schemasComplex schemas where columns may be ambiguous. More accurate but slower.
embeddingsUses embeddings to match column names, which can be faster and more efficient for large datasetsHigh-volume ingestion with predictable, consistent source schemas. Fastest option.
The descriptions you provide in targetCols matter. They are the primary signal the LLM uses to decide which extracted column maps to which target. Be specific: "Email address" is better than "email", and "Annual salary in USD" is better than "salary".

Validate

The Validate step ensures data quality by validating the content of specified columns. It supports three validation modes:
{
    "type": "validate",
    "mode": "regex|rows|fields",
    "columns": ["column1", "column2"]
}
FieldRequiredDescription
typeYesMust be "validate"
modeYesOne of "regex", "rows", or "fields"
columnsYesArray of column names to validate

Validation modes

  1. regex: Validates columns using regular expressions
  2. rows: Validates entire rows based on the specified columns
  3. fields: Validates individual fields in the specified columns
Validated rows will include a __valid field indicating whether the validation passed.

ReverseGeo

The ReverseGeo step adds geographical coordinates (latitude and longitude) based on location names in a specified column.
Note: ReverseGeo is an optional feature not enabled by default. Contact your account representative to add it to your workspace.
{
    "type": "reverseGeo",
    "deriveFrom": "locationColumnName"
}
FieldRequiredDescription
typeYesMust be "reverseGeo"
deriveFromYesName of the column containing location names
This step will add a new column named {locationColumnName}_latlong containing the coordinates. Example output:
citycity_latlong
New York40.7128,-74.0060
London51.5074,-0.1278

Writing effective helper prompts

The helperPrompt field is available on both the table extraction and normalize steps. It is the most impactful parameter for inference quality. A good prompt gives the LLM the context it needs to make accurate decisions. Be specific about the schema you expect. Name the columns and describe their format:
Extract the following fields: MemberName (full name without academic
titles like Dr. or Prof.), Address (street address), City, Country
(normalize all country names to English), Phone (include country code,
replacing a leading 0 with the appropriate code), Email, Website.
Describe the source format so the LLM knows how to parse it:
This text contains a semicolon-delimited table. Columns are separated
by commas and rows are separated by semicolons.
Call out edge cases and transformation rules:
If a phone number does not include a country code (+XX), add the
appropriate code based on the country. If latitude/longitude are
present, format them as "lat,long".
State what to include and exclude:
Extract all entries. Do not skip any. Omit personal honorifics like
Ms., Mr., Frau, Herr from names.

Choosing an LLM

Steps that use LLM inference (table extraction and normalize) accept optional model and provider fields to specify which model runs the step.
{
    "type": "table",
    "deriveFrom": "notes",
    "helperPrompt": "Extract product names and prices.",
    "model": "your-model-here",
    "provider": "openai"
}
FieldRequiredDescription
modelNoThe model identifier
providerNoThe provider name (e.g., "openai", "ollama")
When omitted, DataLinks uses its default model. Smaller models are faster and cheaper for simple extractions; larger models handle ambiguous or complex text better.

Complete example

This example takes unstructured directory entries, extracts structured fields, normalizes the schema, and validates the results:
import requests

# Configuration
API_TOKEN = "YOUR_TOKEN"
NAMESPACE = "your_namespace"
DATASET = "dental_practices"

# Sample data: unstructured directory entries
data = [
    {"entry": "Dr. Alice Chen, Northwind Dental, 42 Oak St, Springfield, (555) 012-3456, alice@northwind.com"},
    {"entry": "Bob Martinez - Rivera Family Dentistry. 100 Main Ave, Portland OR 97201. bob@riveradental.com"}
]

# Inference pipeline configuration
inference_steps = {
    "steps": [
        # Step 1: Extract structured fields from free text
        {
            "type": "table",
            "deriveFrom": "entry",
            "helperPrompt": "Each entry is a dental practice listing. Extract the practitioner name (without Dr. or similar titles), practice name, street address, city, state, phone number (with +1 country code), and email."
        },
        # Step 2: Normalize to consistent schema
        {
            "type": "normalize",
            "mode": "all-in-one",
            "targetCols": {
                "practitioner": "Full name of the dentist, without titles",
                "practice_name": "Name of the dental practice",
                "address": "Street address",
                "city": "City name",
                "state": "US state name or abbreviation",
                "phone": "Phone number with +1 country code",
                "email": "Email address"
            },
            "helperPrompt": "These are US-based dental practices."
        },
        # Step 3: Validate email and phone fields
        {
            "type": "validate",
            "mode": "fields",
            "columns": ["email", "phone"]
        }
    ]
}

# Make the API request
response = requests.post(
    f"https://api.prod.datalinks.com/api/v1/ingest/{NAMESPACE}/{DATASET}",
    headers={
        "Authorization": f"Bearer {API_TOKEN}",
        "Content-Type": "application/json"
    },
    json={
        "data": data,
        "infer": inference_steps
    }
)

# Check result
if response.status_code == 200:
    print("Ingestion successful!")
    print(response.json())
else:
    print(f"Error: {response.status_code}")
    print(response.text)

Saving inference definitions

When you create a dataset, you can save a data description and field definition that DataLinks uses during future ingestion calls. This avoids repeating the same configuration on every request. Set the definition at creation time via the inferDefinition field on Create new dataset, or update it later with Update infer definition.
{
    "dataDescription": "Dental practice directory entries with contact information.",
    "fieldDefinition": "practitioner=Full name without titles\npractice_name=Name of the dental practice\naddress=Street address\ncity=City\nstate=US state\nphone=Phone with country code\nemail=Email address"
}