Documentation Index
Fetch the complete documentation index at: https://docs.datalinks.com/llms.txt
Use this file to discover all available pages before exploring further.
When you ingest data into DataLinks, you can include an inference pipeline that transforms your data before it is stored. The pipeline is a sequence of steps that extract structure from unstructured text, normalize column names to a target schema, validate data quality, and enrich records with geographic coordinates.
Inference steps are specified in the infer.steps array of an ingestion request. Steps execute in order: the output of each step becomes the input for the next.
Step types at a glance
| Step | type value | Purpose |
|---|
| Table extraction | table | Extract structured columns from unstructured text |
| Rows | rows | Expand a JSON array stored in a column into rows |
| Normalize | normalize | Map extracted column names to a target schema |
| Validate | validate | Check data quality and flag invalid rows |
| ReverseGeo | reverseGeo | Add latitude/longitude from location names |
How the pipeline works
The inference can have multiple steps which are combined by our AI. Steps execute sequentially in the order you list them. A typical pipeline follows this pattern:
Raw data
→ Table extraction (unstructured text → columns)
→ Normalize (inconsistent column names → target schema)
→ Validate (flag rows that fail quality checks)
→ Stored in DataLinks
You can use any combination of steps. A single step is fine. So is skipping straight to normalize if your data is already tabular but has inconsistent column names.
The full request structure looks like this:
import requests
response = requests.post(
"https://api.datalinks.com/api/v1/ingest/namespace/datasetName",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json={
"data": [
{"col1": 123, "col2": "foo", "colWithATable": "direction,size;up,large"},
{"col1": 444, "col2": "bar"}
],
"infer": {
"steps": [
{
"type": "table",
"deriveFrom": "colWithATable",
"helperPrompt": "This text contains a table comma separated, but instead of line breaks we are using a ';'."
}
]
}
}
)
In this specific example we specify we want a table extracted from a specific column. This will extract all columns available from the text provided. You can use the helperPrompt to guide the AI system to have higher accuracy during extraction.
Step reference
The Table step is designed to extract a table from any text input, whether it’s free-form text such as a restaurant review, a flight log report, or a financial instrument notice. When used independently, this step may produce columns that are somewhat arbitrary and often require normalization afterward. It functions by extracting data to create new columns and appending them to the rows of the existing table.
{
"type": "table",
"deriveFrom": "source_column",
"helperPrompt": "Optional prompt to guide extraction."
}
| Field | Required | Description |
|---|
type | Yes | Must be "table" |
deriveFrom | Yes | Name of the column containing unstructured text |
helperPrompt | No | Instructions to the LLM about what to extract and how |
Pro tip: Calling an extraction with just the table step is a great way to see what kind of structured data can be generated from your unstructured data.
When processing diverse inputs, the LLM will invent column names based on what it finds. Different rows may produce different names for the same concept (e.g., phone_number vs telephone). Follow table extraction with a normalize step to consolidate these into a consistent schema.
Rows
If a table is stored in JSON format within a column, you can directly transform it into a structured table. The JSON should consist of an array of objects, with each key in the objects mapped to a new column.
{
"type": "rows",
"deriveFrom": "json_column"
}
| Field | Required | Description |
|---|
type | Yes | Must be "rows" |
deriveFrom | Yes | Name of the column containing a JSON array |
For example, if a column contains [{"name": "Alice", "age": 34}, {"name": "Bob", "age": 28}], the Rows step produces two rows with name and age columns.
Normalize
Use the Normalize step to consolidate the schema or column space. In this step, specify the columns you want to include in the final table that will be indexed into DataLinks.
{
"type": "normalize",
"mode": "all-in-one",
"targetCols": {
"full_name": "The person's full name, without titles like Dr. or Mr.",
"age": "Age as a number",
"company": "Company or organization name",
"email": "Email address"
},
"helperPrompt": "This data comes from professional directory listings."
}
| Field | Required | Description |
|---|
type | Yes | Must be "normalize" (or "normalise") |
mode | Yes | One of "all-in-one", "field-by-field", or "embeddings" |
targetCols | Yes | Object mapping desired column names to descriptions |
helperPrompt | No | Domain context to help the LLM interpret column names |
Choosing a mode
The normalize step supports three different modes:
| Mode | How it works | Best for |
|---|
all-in-one | Uses a single prompt to normalize all columns at once | Good default for most cases. Fast and effective when columns are reasonably self-explanatory. |
field-by-field | Normalizes each field individually, which can be more accurate for complex schemas | Complex schemas where columns may be ambiguous. More accurate but slower. |
embeddings | Uses embeddings to match column names, which can be faster and more efficient for large datasets | High-volume ingestion with predictable, consistent source schemas. Fastest option. |
The descriptions you provide in targetCols matter. They are the primary signal the LLM uses to decide which extracted column maps to which target. Be specific: "Email address" is better than "email", and "Annual salary in USD" is better than "salary".
Validate
The Validate step ensures data quality by validating the content of specified columns. It supports three validation modes:
{
"type": "validate",
"mode": "regex|rows|fields",
"columns": ["column1", "column2"]
}
| Field | Required | Description |
|---|
type | Yes | Must be "validate" |
mode | Yes | One of "regex", "rows", or "fields" |
columns | Yes | Array of column names to validate |
Validation modes
- regex: Validates columns using regular expressions
- rows: Validates entire rows based on the specified columns
- fields: Validates individual fields in the specified columns
Validated rows will include a __valid field indicating whether the validation passed.
ReverseGeo
The ReverseGeo step adds geographical coordinates (latitude and longitude) based on location names in a specified column.
Note: ReverseGeo is an optional feature not enabled by default. Contact your account representative to add it to your workspace.
{
"type": "reverseGeo",
"deriveFrom": "locationColumnName"
}
| Field | Required | Description |
|---|
type | Yes | Must be "reverseGeo" |
deriveFrom | Yes | Name of the column containing location names |
This step will add a new column named {locationColumnName}_latlong containing the coordinates.
Example output:
| city | city_latlong |
|---|
| New York | 40.7128,-74.0060 |
| London | 51.5074,-0.1278 |
Writing effective helper prompts
The helperPrompt field is available on both the table extraction and normalize steps. It is the most impactful parameter for inference quality. A good prompt gives the LLM the context it needs to make accurate decisions.
Be specific about the schema you expect. Name the columns and describe their format:
Extract the following fields: MemberName (full name without academic
titles like Dr. or Prof.), Address (street address), City, Country
(normalize all country names to English), Phone (include country code,
replacing a leading 0 with the appropriate code), Email, Website.
Describe the source format so the LLM knows how to parse it:
This text contains a semicolon-delimited table. Columns are separated
by commas and rows are separated by semicolons.
Call out edge cases and transformation rules:
If a phone number does not include a country code (+XX), add the
appropriate code based on the country. If latitude/longitude are
present, format them as "lat,long".
State what to include and exclude:
Extract all entries. Do not skip any. Omit personal honorifics like
Ms., Mr., Frau, Herr from names.
Choosing an LLM
Steps that use LLM inference (table extraction and normalize) accept optional model and provider fields to specify which model runs the step.
{
"type": "table",
"deriveFrom": "notes",
"helperPrompt": "Extract product names and prices.",
"model": "your-model-here",
"provider": "openai"
}
| Field | Required | Description |
|---|
model | No | The model identifier |
provider | No | The provider name (e.g., "openai", "ollama") |
When omitted, DataLinks uses its default model. Smaller models are faster and cheaper for simple extractions; larger models handle ambiguous or complex text better.
Complete example
This example takes unstructured directory entries, extracts structured fields, normalizes the schema, and validates the results:
import requests
# Configuration
API_TOKEN = "YOUR_TOKEN"
NAMESPACE = "your_namespace"
DATASET = "dental_practices"
# Sample data: unstructured directory entries
data = [
{"entry": "Dr. Alice Chen, Northwind Dental, 42 Oak St, Springfield, (555) 012-3456, alice@northwind.com"},
{"entry": "Bob Martinez - Rivera Family Dentistry. 100 Main Ave, Portland OR 97201. bob@riveradental.com"}
]
# Inference pipeline configuration
inference_steps = {
"steps": [
# Step 1: Extract structured fields from free text
{
"type": "table",
"deriveFrom": "entry",
"helperPrompt": "Each entry is a dental practice listing. Extract the practitioner name (without Dr. or similar titles), practice name, street address, city, state, phone number (with +1 country code), and email."
},
# Step 2: Normalize to consistent schema
{
"type": "normalize",
"mode": "all-in-one",
"targetCols": {
"practitioner": "Full name of the dentist, without titles",
"practice_name": "Name of the dental practice",
"address": "Street address",
"city": "City name",
"state": "US state name or abbreviation",
"phone": "Phone number with +1 country code",
"email": "Email address"
},
"helperPrompt": "These are US-based dental practices."
},
# Step 3: Validate email and phone fields
{
"type": "validate",
"mode": "fields",
"columns": ["email", "phone"]
}
]
}
# Make the API request
response = requests.post(
f"https://api.datalinks.com/api/v1/ingest/{NAMESPACE}/{DATASET}",
headers={
"Authorization": f"Bearer {API_TOKEN}",
"Content-Type": "application/json"
},
json={
"data": data,
"infer": inference_steps
}
)
# Check result
if response.status_code == 200:
print("Ingestion successful!")
print(response.json())
else:
print(f"Error: {response.status_code}")
print(response.text)
Saving inference definitions
When you create a dataset, you can save a data description and field definition that DataLinks uses during future ingestion calls. This avoids repeating the same configuration on every request.
Set the definition at creation time via the inferDefinition field on Create new dataset, or update it later with Update infer definition.
{
"dataDescription": "Dental practice directory entries with contact information.",
"fieldDefinition": "practitioner=Full name without titles\npractice_name=Name of the dental practice\naddress=Street address\ncity=City\nstate=US state\nphone=Phone with country code\nemail=Email address"
}