datachain

Wrangle unstructured AI data at scale

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

0x2b3bfa0 dmpetrov shcheklein

These details have not been verified by PyPI

Project links

Documentation

Project description

DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured data like images, audio, videos, text and PDFs. It integrates with external storage (e.g. S3) to process data efficiently without data duplication and manages metadata in an internal database for easy and efficient querying.

Use Cases

ETL. Pythonic framework for describing and running unstructured data transformations and enrichments, applying models to data, including LLMs.
Analytics. DataChain dataset is a table that combines all the information about data objects in one place + it provides dataframe-like API and vectorized engine to do analytics on these tables at scale.
Versioning. DataChain doesn’t store, require moving or copying data. Perfect use case is a bucket with thousands or millions of images, videos, audio, PDFs.
Incremental Processing. DataChain’s delta and retry features allow for efficient processing workflows:
- Delta Processing: Process only new or changed files/records
- Retry Processing: Automatically reprocess records with errors or missing results
- Combined Approach: Process new data and fix errors in a single pipeline

Getting Started

Visit Quick Start and Docs to get started with DataChain and learn more.

pip install datachain

Example: Download Subset of Files Based on Metadata

Sometimes users only need to download a specific subset of files from cloud storage, rather than the entire dataset. For example, you could use a JSON file’s metadata to download just cat images with high confidence scores.

import datachain as dc

meta = dc.read_json("gs://datachain-demo/dogs-and-cats/*json", column="meta", anon=True)
images = dc.read_storage("gs://datachain-demo/dogs-and-cats/*jpg", anon=True)

images_id = images.map(id=lambda file: file.path.split('.')[-2])
annotated = images_id.merge(meta, on="id", right_on="meta.id")

likely_cats = annotated.filter((dc.Column("meta.inference.confidence") > 0.93) \
                               & (dc.Column("meta.inference.class_") == "cat"))
likely_cats.to_storage("high-confidence-cats/", signal="file")

Example: Incremental Processing with Error Handling

This example shows how to use both delta and retry processing for efficient handling of large datasets that evolve over time and may occasionally have processing errors.

import datachain as dc

def process_file(file: dc.File) -> tuple[str, str, str]:
    """Analyze a file, may occasionally fail."""
    try:
        # Your processing logic here
        content = file.read_text()
        result = content.upper()
        return content, result, ""  # No error
    except Exception as e:
        # Return an error that will trigger reprocessing next time
        return "", "", str(e)  # Error field will trigger retry

# Process files efficiently with delta and retry
# Run it many times, keep adding files, to see delta and retry in action
chain = (
    dc.read_storage(
        "data/",
        update=True,
        delta=True,              # Process only new/changed files
        delta_on="file.path",    # Identify files by path
        delta_retry="error",     # Process files with error again
    )
    .map(process_file, output=("content", "result", "error"))
    .save("processed-data")
)

Example: LLM based text-file evaluation

In this example, we evaluate chatbot conversations stored in text files using LLM based evaluation.

$ pip install mistralai # Requires version >=1.0.0
$ export MISTRAL_API_KEY=_your_key_

Python code:

import os
from mistralai import Mistral
import datachain as dc

PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure."

def eval_dialogue(file: dc.File) -> bool:
     client = Mistral(api_key = os.environ["MISTRAL_API_KEY"])
     response = client.chat.complete(
         model="open-mixtral-8x22b",
         messages=[{"role": "system", "content": PROMPT},
                   {"role": "user", "content": file.read()}])
     result = response.choices[0].message.content
     return result.lower().startswith("success")

chain = (
   dc.read_storage("gs://datachain-demo/chatbot-KiT/", column="file", anon=True)
   .settings(parallel=4, cache=True)
   .map(is_success=eval_dialogue)
   .save("mistral_files")
)

successful_chain = chain.filter(dc.Column("is_success") == True)
successful_chain.to_storage("./output_mistral")

print(f"{successful_chain.count()} files were exported")

With the instruction above, the Mistral model considers 31/50 files to hold the successful dialogues:

$ ls output_mistral/datachain-demo/chatbot-KiT/
1.txt  15.txt 18.txt 2.txt  22.txt 25.txt 28.txt 33.txt 37.txt 4.txt  41.txt ...
$ ls output_mistral/datachain-demo/chatbot-KiT/ | wc -l
31

Key Features

📂 Multimodal Dataset Versioning.

Version unstructured data without moving or creating data copies, by supporting references to S3, GCP, Azure, and local file systems.
Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.
Unite files and metadata together into persistent, versioned, columnar datasets.

🐍 Python-friendly.

Operate on Python objects and object fields: float scores, strings, matrixes, LLM response objects.
Run Python code in a high-scale, terabytes size datasets, with built-in parallelization and memory-efficient computing — no SQL or Spark required.

🧠 Data Enrichment and Processing.

Generate metadata using local AI models and LLM APIs.
Filter, join, and group datasets by metadata. Search by vector embeddings.
High-performance vectorized operations on Python objects: sum, count, avg, etc.
Pass datasets to Pytorch and Tensorflow, or export them back into storage.

Contributing

Contributions are very welcome. To learn more, see the Contributor Guide.

Community and Support

Docs
File an issue if you encounter any problems
Discord Chat
Email
Twitter

DataChain Studio Platform

DataChain Studio is a proprietary solution for teams that offers:

Centralized dataset registry to manage data, code and dependencies in one place.
Data Lineage for data sources as well as derivative dataset.
UI for Multimodal Data like images, videos, and PDFs.
Scalable Compute to handle large datasets (100M+ files) and in-house AI model inference.
Access control including SSO and team based collaboration.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

0x2b3bfa0 dmpetrov shcheklein

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

0.53.0

Apr 24, 2026

0.52.0

Apr 20, 2026

0.51.1

Apr 17, 2026

0.51.0

Apr 13, 2026

0.50.2

Apr 8, 2026

0.50.1

Apr 7, 2026

This version

0.50.0

Apr 2, 2026

0.49.1

Mar 30, 2026

0.49.0

Mar 28, 2026

0.48.4

Mar 27, 2026

0.48.3

Mar 24, 2026

0.48.2

Mar 24, 2026

0.48.1

Mar 21, 2026

0.48.0

Mar 16, 2026

0.47.2

Mar 15, 2026

0.47.1

Mar 5, 2026

0.47.0

Mar 5, 2026

0.46.5

Mar 5, 2026

0.46.4

Mar 2, 2026

0.46.3

Feb 25, 2026

0.46.2

Feb 24, 2026

0.46.1

Feb 17, 2026

0.46.0

Feb 15, 2026

0.45

Feb 10, 2026

0.44.9

Jan 29, 2026

0.44.8

Jan 29, 2026

0.44.7

Jan 25, 2026

0.44.6

Jan 23, 2026

0.44.5

Jan 22, 2026

0.44.4

Jan 14, 2026

0.44.3

Jan 13, 2026

0.44.2

Jan 12, 2026

0.44.1

Jan 7, 2026

0.44.0

Dec 31, 2025

0.43.2

Dec 31, 2025

0.43.1

Dec 29, 2025

0.43.0

Dec 24, 2025

0.42.0

Dec 22, 2025

0.41.0

Dec 19, 2025

0.40.2

Dec 19, 2025

0.40.1

Dec 18, 2025

0.40.0

Dec 14, 2025

0.39.0

Dec 10, 2025

0.38.5

Dec 7, 2025

0.38.4

Dec 5, 2025

0.38.3

Dec 3, 2025

0.38.2

Nov 30, 2025

0.38.1

Nov 29, 2025

0.38.0

Nov 22, 2025

0.37.15

Nov 15, 2025

0.37.14

Nov 15, 2025

0.37.13

Nov 11, 2025

0.37.12

Nov 7, 2025

0.37.11

Nov 4, 2025

0.37.10

Nov 2, 2025

0.37.9

Oct 30, 2025

0.37.8

Oct 30, 2025

0.37.7

Oct 28, 2025

0.37.6

Oct 28, 2025

0.37.5

Oct 27, 2025

0.37.4

Oct 27, 2025

0.37.3

Oct 27, 2025

0.37.2

Oct 26, 2025

0.37.1

Oct 22, 2025

0.37.0

Oct 20, 2025

0.36.6

Oct 19, 2025

0.36.5

Oct 18, 2025

0.36.4

Oct 18, 2025

0.36.3

Oct 17, 2025

0.36.2

Oct 16, 2025

0.36.1

Oct 16, 2025

0.36.0

Oct 15, 2025

0.35.2

Oct 13, 2025

0.35.1

Oct 12, 2025

0.35.0

Oct 9, 2025

0.34.7

Oct 9, 2025

0.34.6

Oct 5, 2025

0.34.5

Oct 4, 2025

0.34.4

Oct 3, 2025

0.34.3

Oct 2, 2025

0.34.2

Oct 1, 2025

0.34.1

Oct 1, 2025

0.34.0

Sep 30, 2025

0.33.1

Sep 26, 2025

0.33.0

Sep 24, 2025

0.32.3

Sep 16, 2025

0.32.2

Sep 16, 2025

0.32.1

Sep 14, 2025

0.32.0

Sep 11, 2025

0.31.4

Sep 11, 2025

0.31.3

Sep 11, 2025

0.31.2

Sep 10, 2025

0.31.1

Sep 10, 2025

0.31.0

Sep 3, 2025

0.30.7

Sep 2, 2025

0.30.6

Aug 29, 2025

0.30.5

Aug 29, 2025

0.30.4 yanked

Aug 27, 2025

Reason this release was yanked:

DataChain 0.30.4 was released out of the stable release window; the release has been postponed

0.30.3

Aug 21, 2025

0.30.2

Aug 16, 2025

0.30.1

Aug 13, 2025

0.30.0

Aug 12, 2025

0.29.1

Aug 11, 2025

0.29.0

Aug 11, 2025

0.28.2

Aug 6, 2025

0.28.1

Jul 30, 2025

0.28.0

Jul 28, 2025

0.27.0

Jul 24, 2025

0.26.4

Jul 17, 2025

0.26.3

Jul 15, 2025

0.26.2

Jul 15, 2025

0.26.1

Jul 15, 2025

0.26.0

Jul 12, 2025

0.25.2

Jul 10, 2025

0.25.1

Jul 10, 2025

0.25.0

Jul 9, 2025

0.24.6

Jul 9, 2025

0.24.5

Jul 8, 2025

0.24.4

Jul 5, 2025

0.24.3

Jul 3, 2025

0.24.2

Jul 2, 2025

0.24.1

Jun 30, 2025

0.24.0

Jun 29, 2025

0.23.0

Jun 28, 2025

0.22.0

Jun 26, 2025

0.21.1

Jun 25, 2025

0.21.0

Jun 25, 2025

0.20.4 yanked

Jun 24, 2025

Reason this release was yanked:

accidental release of experimental features

0.20.3 yanked

Jun 24, 2025

Reason this release was yanked:

accidental release of experimental features

0.20.2 yanked

Jun 20, 2025

Reason this release was yanked:

accidental release of experimental features

0.20.1 yanked

Jun 20, 2025

Reason this release was yanked:

accidental release of experimental features

0.20.0 yanked

Jun 19, 2025

Reason this release was yanked:

accidental release of experimental features

0.19.3

Jun 24, 2025

0.19.2

Jun 11, 2025

0.19.1

Jun 10, 2025

0.19

Jun 9, 2025

0.18.11

Jun 5, 2025

0.18.10

Jun 4, 2025

0.18.9

Jun 3, 2025

0.18.8

Jun 3, 2025

0.18.7

Jun 2, 2025

0.18.6

May 28, 2025

0.18.5

May 28, 2025

0.18.4

May 22, 2025

0.18.3

May 21, 2025

0.18.2

May 21, 2025

0.18.1

May 16, 2025

0.18.0

May 15, 2025

0.17.2

May 11, 2025

0.17.1

May 10, 2025

0.17.0

May 9, 2025

0.16.5

May 8, 2025

0.16.4

May 1, 2025

0.16.3

Apr 28, 2025

0.16.2

Apr 22, 2025

0.16.1

Apr 21, 2025

0.16.0

Apr 18, 2025

0.15.0

Apr 18, 2025

0.14.5

Apr 8, 2025

0.14.4

Apr 1, 2025

0.14.3

Mar 31, 2025

0.14.2

Mar 29, 2025

0.14.1

Mar 27, 2025

0.14.0

Mar 26, 2025

0.13.1

Mar 24, 2025

0.13.0

Mar 19, 2025

0.12.0

Mar 17, 2025

0.11.11

Mar 6, 2025

0.11.0

Feb 27, 2025

0.10.0

Feb 20, 2025

0.9.1

Feb 15, 2025

0.9.0

Feb 14, 2025

0.8.13

Feb 3, 2025

0.8.12

Jan 31, 2025

0.8.11

Jan 28, 2025

0.8.10

Jan 20, 2025

0.8.9

Jan 16, 2025

0.8.8

Jan 13, 2025

0.8.7

Jan 12, 2025

0.8.6

Jan 12, 2025

0.8.5

Jan 9, 2025

0.8.4

Jan 6, 2025

0.8.3

Dec 29, 2024

0.8.2

Dec 27, 2024

0.8.1

Dec 26, 2024

0.8.0

Dec 22, 2024

0.7.11

Dec 12, 2024

0.7.10

Dec 9, 2024

0.7.9

Dec 6, 2024

0.7.8

Dec 3, 2024

0.7.7

Dec 2, 2024

0.7.6

Nov 29, 2024

0.7.5

Nov 29, 2024

0.7.4

Nov 29, 2024

0.7.3

Nov 27, 2024

0.7.2

Nov 27, 2024

0.7.1

Nov 22, 2024

0.7.0

Nov 20, 2024

0.6.11

Nov 19, 2024

0.6.10

Nov 17, 2024

0.6.9

Nov 13, 2024

0.6.8

Nov 7, 2024

0.6.7

Nov 6, 2024

0.6.6

Nov 6, 2024

0.6.5

Nov 1, 2024

0.6.4

Oct 31, 2024

0.6.3

Oct 30, 2024

0.6.2

Oct 28, 2024

0.6.1

Oct 16, 2024

0.6.0

Oct 14, 2024

0.5.1

Oct 7, 2024

0.5.0

Sep 26, 2024

0.4.0

Sep 24, 2024

0.3.20

Sep 23, 2024

0.3.19

Sep 23, 2024

0.3.18

Sep 18, 2024

0.3.17

Sep 17, 2024

0.3.16

Sep 16, 2024

0.3.15

Sep 16, 2024

0.3.14

Sep 12, 2024

0.3.13

Sep 11, 2024

0.3.12

Sep 11, 2024

0.3.11

Sep 9, 2024

0.3.10

Sep 5, 2024

0.3.9

Aug 28, 2024

0.3.8

Aug 27, 2024

0.3.7

Aug 22, 2024

0.3.6

Aug 22, 2024

0.3.5

Aug 21, 2024

0.3.4

Aug 19, 2024

0.3.3

Aug 18, 2024

0.3.2

Aug 15, 2024

0.3.1

Aug 8, 2024

0.3.0

Aug 7, 2024

0.2.18

Aug 6, 2024

0.2.17

Aug 6, 2024

0.2.16

Aug 2, 2024

0.2.15

Jul 31, 2024

0.2.14

Jul 29, 2024

0.2.13

Jul 25, 2024

0.2.12

Jul 23, 2024

0.2.11

Jul 18, 2024

0.2.10

Jul 17, 2024

0.2.9

Jul 15, 2024

0.2.8

Jul 15, 2024

0.2.7

Jul 15, 2024

0.2.6

Jul 12, 2024

0.2.5

Jul 11, 2024

0.2.4

Jul 10, 2024

0.2.3

Jul 10, 2024

0.2.2

Jul 10, 2024

0.2.1

Jul 8, 2024

0.2.0

Jul 5, 2024

0.1.13

Jun 28, 2024

0.1.12

Jun 27, 2024

0.1.11

Jun 27, 2024

0.1.10

Jun 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datachain-0.50.0.tar.gz (3.4 MB view details)

Uploaded Apr 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datachain-0.50.0-py3-none-any.whl (394.4 kB view details)

Uploaded Apr 2, 2026 Python 3

File details

Details for the file datachain-0.50.0.tar.gz.

File metadata

Download URL: datachain-0.50.0.tar.gz
Upload date: Apr 2, 2026
Size: 3.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datachain-0.50.0.tar.gz
Algorithm	Hash digest
SHA256	`2eba91821f6e35aae2477d6977b91023e408919ba3acb0b21c6b84090c4cdf11`
MD5	`5a6cdb1aea22ff579da33c4a037ce360`
BLAKE2b-256	`3b8f1da14bcd087fb4c214efd5a197ca65404141610bc67f5df6339a2df7045c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datachain-0.50.0.tar.gz:

Publisher: release.yml on datachain-ai/datachain

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datachain-0.50.0.tar.gz
- Subject digest: 2eba91821f6e35aae2477d6977b91023e408919ba3acb0b21c6b84090c4cdf11
- Sigstore transparency entry: 1214459921
- Sigstore integration time: Apr 2, 2026
Source repository:
- Permalink: datachain-ai/datachain@a7babb2e7d0e5f89de011d5114e9898e578d36ea
- Branch / Tag: refs/tags/0.50.0
- Owner: https://github.com/datachain-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a7babb2e7d0e5f89de011d5114e9898e578d36ea
- Trigger Event: release

File details

Details for the file datachain-0.50.0-py3-none-any.whl.

File metadata

Download URL: datachain-0.50.0-py3-none-any.whl
Upload date: Apr 2, 2026
Size: 394.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for datachain-0.50.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ee0032bf093d0231ed8ac9c6d09c449d27a69f52422c02460d5d89bfdfa7fa9e`
MD5	`e0a861b632906dc6fdf0c098f6b6e813`
BLAKE2b-256	`33b9ecdd232ed69c463b4377687568eac1433e5b239a2466208990f0adf07c88`

See more details on using hashes here.

Provenance

The following attestation bundles were made for datachain-0.50.0-py3-none-any.whl:

Publisher: release.yml on datachain-ai/datachain

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: datachain-0.50.0-py3-none-any.whl
- Subject digest: ee0032bf093d0231ed8ac9c6d09c449d27a69f52422c02460d5d89bfdfa7fa9e
- Sigstore transparency entry: 1214459968
- Sigstore integration time: Apr 2, 2026
Source repository:
- Permalink: datachain-ai/datachain@a7babb2e7d0e5f89de011d5114e9898e578d36ea
- Branch / Tag: refs/tags/0.50.0
- Owner: https://github.com/datachain-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a7babb2e7d0e5f89de011d5114e9898e578d36ea
- Trigger Event: release

datachain 0.50.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Use Cases

Getting Started

Example: Download Subset of Files Based on Metadata

Example: Incremental Processing with Error Handling

Example: LLM based text-file evaluation

Key Features

Contributing

Community and Support

DataChain Studio Platform

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance