Introduction

When uploading data, you can derive new columns from existing ones using our AI for automatic derivations. To do this, you’ll need to specify the desired columns and the steps you want to execute. We support several steps including Table, Rows, Normalize, Validate, ReverseGeo, and column inference.

curl -H "Authorization: Bearer token1234abcdefg"
     -X POST https://api.datalinks.com/api/ingest/namespace/datasetName
     --data '{
               "data": [{"col1": 123, "col2": "foo", "colWithATable": "direction,size;up,large"}, {"col1": 444, "col2": "bar"}],
                 "infer": {
                   "steps": [{
                      "type": "table",
                      "deriveFrom": "colWithATable",
                      "helperPrompt": "This text contains a table comma separated, but instead of line breaks we are using a ';'."
                   }]
                 }
              }'
  • The inference can have multiple steps which are combined by our AI. The example above extracts a simple table, but you can have other steps such as table, rows, normalize, validate, and reverseGeo.

  • In this specific example we specify we want a table extracted from a specific column. This will extract all columns available from the text provided. You can use the helperPrompt to guide the AI system to have higher accuracy during extraction.

Available steps

Table extraction

The Table Step is designed to extract a table from any text input, whether it’s free-form text such as a restaurant review, a flight log report, or a financial instrument notice. When used independently, this step may produce columns that are somewhat arbitrary and often require normalization afterward. It functions by extracting data to create new columns and appending them to the rows of the existing table.

Pro tip: Calling an extraction with just the table step is a great way to see what kind of structured data can be generated from your unstructured data.

{
  "type": "table",
  "deriveFrom": "columnName from where the column should be derived from",
  "helperPrompt": "This text contains a table comma separated, but instead of line breaks we are using a ';'."
}

Rows

If a table is stored in JSON format within a column, you can directly transform it into a structured table. The JSON should consist of an array of objects, with each key in the objects mapped to a new column.

{
  "type": "rows",
  "deriveFrom": "columnName from where the json content will be read"
}

Normalize names

Use the Normalize Names step to consolidate the schema or column space. In this step, specify the columns you want to include in the final table that will be indexed into DataLinks.

{
  "type": "normalize",
  "mode": "all-in-one",
  "targetCols": {
      "NameOfTheColumn": "Description for what this column is, which will enable the AI to automatically transform other names into the desired schema, you can have as many columns as you want"
  },
  "helperPrompt": "This prompt will help the llm understand what the domain you are operating under is. So that can correctly interpret the existing column names and the target columns."
}

The normalize step supports three different modes:

  1. all-in-one: Uses a single prompt to normalize all columns at once
  2. field-by-field: Normalizes each field individually, which can be more accurate for complex schemas
  3. embeddings: Uses embeddings to match column names, which can be faster and more efficient for large datasets

Validate

The Validate step ensures data quality by validating the content of specified columns. It supports three validation modes:

{
  "type": "validate",
  "mode": "regex|rows|fields",
  "columns": ["column1", "column2"]
}
  1. regex: Validates columns using regular expressions
  2. rows: Validates entire rows based on the specified columns
  3. fields: Validates individual fields in the specified columns

Validated rows will include a __valid field indicating whether the validation passed.

ReverseGeo

The ReverseGeo step adds geographical coordinates (latitude and longitude) based on location names in a specified column.

{
  "type": "reverseGeo",
  "deriveFrom": "locationColumnName"
}

This step will add a new column named {locationColumnName}_latlong containing the coordinates.