Skip to main content

Base package to build indexing scripts for DataLinks

Project description

DataLinks Python SDK

Overview

The DataLinks Python SDK is designed to simplify data ingestion, normalization, linking, and querying processes with DataLinks. It integrates with the DataLinks API to provide a seamless development experience for managing data workflows, including entity resolution and inference steps, with robust configuration options.

This SDK is designed to accelerate the development of applications with DataLinks by wrapping the API integrations with a Pythonic interface, supporting flexible chaining of inference and validation steps.


Features

  • Ingestion API: Easily ingest data into namespaces with built-in batching and retry mechanisms.
  • Inference Workflow Management: Define custom chains of inference and validation steps.
  • Entity Resolution: Match entities using configurable exact or geo-based matching methods.
  • Namespace Management: Create and manage namespaces with privacy options.
  • Data Querying: Query data with options to include/exclude metadata.
  • Custom Loaders: Load custom data formats like JSON into defined workflows.
  • CLI Tool: Standardized command-line interface for managing ingestion pipelines quickly.

Installation

To install the SDK, simply use pip:

pip install datalinks

If you want to install the package in an editable development mode:

  1. Clone the repository from your version-control system.
  2. Create a virtual environment with your tool/distro of choice.
  3. Run the following:
pip install -e .

Quick Start

Here’s how to get started with the DataLinks SDK:

  1. Configuration Ensure you have your required environment variables set up for the DataLinks API:

    • HOST
    • DL_API_KEY
    • INDEX
    • NAMESPACE
    • OBJECT_NAME (optional)

    Alternatively, you can use a .env file in the root of your project for configuration.

  2. Basic Example

    Import the SDK and initialize the configuration:

from datalinks.api import DataLinksAPI, DLConfig

   # Initialize configuration
   config = DLConfig.from_env()

   # Instantiate API client
   client = DataLinksAPI(config=config)

   # Query data
   data = client.query_data(query="*", include_metadata=False)
   print(data)
  1. CLI Usage

    The SDK also provides a built-in CLI that can be extended:

datalinks-client [-h] --verbose <input-folder>

Components

1. DLConfig

DLConfig reads configurations (e.g., API keys) via environment variables or .env files. This enables dynamic adaptation across deployment environments.

2. DataLinksAPI

DataLinksAPI handles interactions with the API. You can:

  • Ingest data.
  • Query or retrieve data with complex parameters.
  • Manage namespaces.

3. Inference Workflow

Use a chain of inference and validation steps defined through classes like TableStep, NormaliseStep, and ValidateStep to automate data preparation workflows.

from datalinks.chain import Chain, TableStep, NormaliseStep, ValidateStep, ValidateModes

# Define an inference chain
inference_steps = Chain(
    TableStep(derive_from="source_field", helper_prompt="This extracts tables."),
    NormaliseStep(target_cols={"email": "email_address"}, mode="all-in-one"),
    ValidateStep(mode=ValidateModes.FIELDS, columns=["email", "phone"]),
)

4. Entity Resolution

Supports multiple resolution strategies, configurable via MatchTypeConfig:

from datalinks.links import MatchTypeConfig, ExactMatch

entity_resolution = MatchTypeConfig(
    exact_match=ExactMatch(minVariation=0.2, minDistinct=0.3)
)

5. Loaders

Abstract base loaders (e.g., JSONLoader) allow seamless data ingestion from custom file formats like .json.


Run Unit Tests

Run tests to verify your implementation:

tox

License

DataLinks Python SDK is licensed under the MIT License. See the LICENSE file for more details.


Support

For questions or support, contact us at info@datasetlinks.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datalinks-0.0.8-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page