Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.datalinks.com/llms.txt

Use this file to discover all available pages before exploring further.

datalinks package

Subpackages

Submodules

Bases: object Command-Line Interface (CLI) wrapper for customizable argument parsing. Simplifies the creation and usage of the DataLinks CLI by allowing to pass a custom callback function for additional arguments specific to an application. It provides a standard set of CLI arguments while enabling customization through user-defined groups.
  • Variables:
    • name – The name of the CLI program.
    • description – The description of the CLI program.
  • Return type: Namespace
Bases: object The base type for entity resolution operators.
  • Variables: targetColumns – A list of column names that are the target for matching. If None, all columns are used for entity resolution.

targetColumns : List[str] | None = None

Bases: MatchType Use this match type to evaluate and configure specific exact matching criteria for the data values.
  • Variables:
    • minVariation – Minimum allowable variation in the field to check for matches (defaults to 0.0).
    • minDistinct – Minimum percentage of distinct values in the field to check for matches (defaults to 0.0).

minVariation : float | None = None

minDistinct : float | None = None

Bases: MatchType use this match type to check for matches in fields that represent geographical attributes.
  • Variables:
    • distance – The maximum distance value for the geographical match (defaults to 2.0)
    • distanceUnit – The unit of measurement for the distance, such as kilometers or miles (defaults to ‘kilometers’).

distance : float | None = None

distanceUnit : str | None = None

Bases: StrEnum Enumerates the various resolution strategies for handling matching or reconciliation of entity data. Each enumeration value specifies a particular method or approach used for determining entity equivalence or correspondence.
  • Variables:
    • ExactMatch – Used when entities are determined to be equivalent based on exact value matches without any approximation.
    • GeoMatch – Used when entities are matched based on their geographical location or proximity.

ExactMatch = ‘ExactMatch’

GeoMatch = ‘GeoMatch’

Bases: object Encapsulates configuration related to different types of entity resolution matches. This class is designed to store, manage, and provide access to various entity resolution match type configurations, such as ExactMatch and GeoMatch. It maintains internal state for these match types and also provides access to a consolidated configuration in dictionary format.
  • Variables: matchTypes – A dictionary mapping entity resolution types to their respective match configurations (e.g., ExactMatch, GeoMatch).

matchTypes : dict[EntityResolutionTypes, MatchType | None]

property config : Dict[str, Dict] | None

Bases: ABC Abstract base class for loading resources from a specified folder. It serves as a template for loading files or other resources while maintaining consistency across different implementations.
  • Variables: folder – Path to the folder from which resources will be loaded.

abstractmethod load_from_folder()

Bases: Loader A loader for processing JSON files in a specified folder. Iterates through all .json files within a given folder, parses their content, and processes each JSON object into a standardized format using the load_item method.
  • Variables: folder – Path to the folder containing JSON files. All .json files in this folder will be processed.

load_from_folder()

  • Return type: List[Dict[str, str]]

abstractmethod load_item(row)

  • Return type: Dict[str, str]
Bases: Enum Represents different types of processing steps for data manipulation. This class enumerates various distinct processing types that can be used in DataLinks workflows. Each enumeration value signifies a specific stage in the broader data-processing pipeline.

TABLE = ‘table’

ROWS = ‘rows’

NORMALIZE = ‘normalise’

VALIDATE = ‘validate’

REVERSE_GEO = ‘reverseGeo’

Bases: Enum Enumeration for normalization modes. This class represents different modes of data normalization used in the ‘normalize’ step. It provides three options for normalization: ‘embeddings’ for embedding-level normalization, ‘all-in-one’ for holistic normalization, and ‘field-by-field’ for column-wise normalization.
  • Variables:
    • EMBEDDINGS – Mode for normalizing data on an embedding level.
    • ALL_IN_ONE – Mode for normalizing data holistically, treating the entire dataset as a single entity.
    • FIELD_BY_FIELD – Mode for normalizing data column-by-column, focusing on individual fields independently.

EMBEDDINGS = ‘embeddings’

ALL_IN_ONE = ‘all-in-one’

FIELD_BY_FIELD = ‘field-by-field’

Bases: Enum Enumeration class that defines various validation modes. This class is designed to specify the modes of operation for the ‘validate’ step. The predefined modes include validation by rows, regular expressions, and fields.
  • Variables:
    • ROWS – Validation mode that focuses on rows.
    • REGEX – Validation mode that utilizes regular expressions.
    • FIELDS – Validation mode that focuses on columns.

ROWS = ‘rows’

REGEX = ‘regex’

FIELDS = ‘fields’

Bases: object Represents the base step within DataLinks. This class serves as the foundational step structure for various implementations. It includes methods to transform its data representation into a dictionary format, custom-processed with specific rules for attributes of Enum type. It is primarily designed as a metaclass.
  • Variables: step_type – The type of the step, categorized using StepTypes.

step_type : ClassVar[StepTypes]

to_dict()

  • Return type: dict
Bases: BaseStep Common class for pipeline steps that rely on LLM inference.
  • Variables:
    • model – The name of the model to use in the step.
    • provider – The identifier of the provider to be used (ollama, openai, etc)

model : str | None

provider : str | None

Bases: BaseStep Represents the ‘infer’ step in the DataLinks workflow.
  • Variables: derive_from – The identifier of the source field used in the inference step.

derive_from : str

Bases: LlmStep, InferenceStep Use this step to infer a table from unstructured data.
  • Variables: helper_prompt – A string that stores an optional helper prompt or additional guiding context specific to the table inference step.

step_type : ClassVar[StepTypes] = ‘table’

helper_prompt : str = ”

Bases: InferenceStep Use this step to extract data that is already in tabular format (eg.: CSV).

step_type : ClassVar[StepTypes] = ‘rows’

Bases: InferenceStep Use this step to perform reverse geolocation based on the source field.

step_type : ClassVar[StepTypes] = ‘reverseGeo’

Bases: LlmStep Use this step to attempt normalisation of the extracted column names. Table inference across different unstructured data blocks may result in different field names for the same information, hence the need to normalize the column names. Encapsulates the configuration necessary to perform the ‘normalize’ step. It specifies the desired target columns, the mode of normalisation, and includes optional helper prompts to provide further instructions or context.
  • Variables:
    • target_cols – A mapping of the desired column names to an optional description used as context.
    • mode – Specifies the normalisation mode to be applied.
    • helper_prompt – Optional helper text or prompt information.

step_type : ClassVar[StepTypes] = ‘normalise’

target_cols : Mapping[str, str | None]

mode : NormalizeModes

helper_prompt : str = ”

Bases: LlmStep Use this step to add data validation to the inference pipeline.
  • Variables:
    • mode – Indicates the mode of validation to be applied.
    • columns – List containing the column names which are used for validation.

step_type : ClassVar[StepTypes] = ‘validate’

mode : ValidateModes

columns : List[str]

Bases: object Represents a collection of sequential steps. Holds and manages the sequence of steps used for ingesting and/or enhancing data.
  • Variables: steps – A collection of steps to be executed in sequence.

to_list()

  • Return type: list[dict[str, Any]]