Skip to main content

Haystack custom components for your favourite dataframe library.

Project description

Dataframes Haystack

PyPI - Version PyPI - Python Version PyPI - License

Code style: black Ruff

GH Actions Tests pre-commit.ci status



📃 Description

dataframes-haystack is an extension for Haystack 2 that enables integration with dataframe libraries.

The dataframe libraries currently supported are:

The library offers various custom Converters components to transform dataframes into Haystack Document objects:

  • DataFrameFileToDocument is a main generic converter that reads files using a dataframe backend and converts them into Document objects.
  • FileToPandasDataFrame and FileToPolarsDataFrame read files and convert them into dataframes.
  • PandasDataFrameConverter or PolarsDataFrameConverter convert data stored in dataframes into Haystack Documentobjects.

dataframes-haystack supports reading files in various formats:

  • csv, json, parquet, excel, html, xml, orc, pickle, fixed-width format for pandas. See the pandas documentation for more details.
  • csv, json, parquet, excel, avro, delta, ipc for polars. See the polars documentation for more details.

🛠️ Installation

# for pandas
pip install "dataframes-haystack[pandas]"

# for polars
pip install "dataframes-haystack[polars]"

💻 Usage

[!TIP] See the Example Notebooks for complete examples.

DataFrameFileToDocument

Complete example

You can leverage both pandas and polars backends (thanks to narwhals) to read your data!

from dataframes_haystack.components.converters import DataFrameFileToDocument

converter = DataFrameFileToDocument(content_column="text_str")
documents = converter.run(files=["file1.csv", "file2.csv"])
>>> documents
{'documents': [
    Document(id=0, content: 'Hello world', meta: {}),
    Document(id=1, content: 'Hello everyone', meta: {})
]}

pandas Converters

Complete example

FileToPandasDataFrame

from dataframes_haystack.components.converters.pandas import FileToPandasDataFrame

converter = FileToPandasDataFrame(file_format="csv")

output_dataframe = converter.run(
    file_paths=["data/doc1.csv", "data/doc2.csv"]
)

Result:

>>> output_dataframe
{'dataframe': <pandas.DataFrame>}

PandasDataFrameConverter

import pandas as pd

from dataframes_haystack.components.converters.pandas import PandasDataFrameConverter

df = pd.DataFrame({
    "text": ["Hello world", "Hello everyone"],
    "filename": ["doc1.txt", "doc2.txt"],
})

converter = PandasDataFrameConverter(content_column="text", meta_columns=["filename"])
documents = converter.run(df)

Result:

>>> documents
{'documents': [
    Document(id=0, content: 'Hello world', meta: {'filename': 'doc1.txt'}),
    Document(id=1, content: 'Hello everyone', meta: {'filename': 'doc2.txt'})
]}

polars Converters

Complete example

FileToPolarsDataFrame

from dataframes_haystack.components.converters.polars import FileToPolarsDataFrame

converter = FileToPolarsDataFrame(file_format="csv")

output_dataframe = converter.run(
    file_paths=["data/doc1.csv", "data/doc2.csv"]
)

Result:

>>> output_dataframe
{'dataframe': <polars.DataFrame>}

PolarsDataFrameConverter

import polars as pl

from dataframes_haystack.components.converters.polars import PolarsDataFrameConverter

df = pl.DataFrame({
    "text": ["Hello world", "Hello everyone"],
    "filename": ["doc1.txt", "doc2.txt"],
})

converter = PolarsDataFrameConverter(content_column="text", meta_columns=["filename"])
documents = converter.run(df)

Result:

>>> documents
{'documents': [
    Document(id=0, content: 'Hello world', meta: {'filename': 'doc1.txt'}),
    Document(id=1, content: 'Hello everyone', meta: {'filename': 'doc2.txt'})
]}

🤝 Contributing

Do you have an idea for a new feature? Did you find a bug that needs fixing?

Feel free to open an issue or submit a PR!

Setup development environment

Requirements: hatch, pre-commit

  1. Clone the repository
  2. Run hatch shell to create and activate a virtual environment
  3. Run pre-commit install to install the pre-commit hooks. This will force the linting and formatting checks.

Run tests

  • Linting and formatting checks: hatch run lint:fmt
  • Unit tests: hatch run test-cov-all

✍️ License

dataframes-haystack is distributed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataframes_haystack-0.0.5.tar.gz (163.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataframes_haystack-0.0.5-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file dataframes_haystack-0.0.5.tar.gz.

File metadata

  • Download URL: dataframes_haystack-0.0.5.tar.gz
  • Upload date:
  • Size: 163.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dataframes_haystack-0.0.5.tar.gz
Algorithm Hash digest
SHA256 960299cd4d247a7deda6fbb75f800178acc7c7ae5f757f01f6426e333496034d
MD5 671dde896a763ef7c90db26391ec230f
BLAKE2b-256 fe4ea80b75b5771b826433a3a62e2c0c44a9e54d5ed3ea3fc42c56b50fcf213b

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataframes_haystack-0.0.5.tar.gz:

Publisher: publish.yml on EdAbati/dataframes-haystack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dataframes_haystack-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for dataframes_haystack-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 ff8e1e62f2e8a52aa273a43d48186f5b26660650aa35d8247b27225c22a78ee9
MD5 fe1acab12a2fbd9cd304905e942529d7
BLAKE2b-256 2defcf75a24d226bdfe79b8172bea67963355453713041a48cbd132eb54ce1a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataframes_haystack-0.0.5-py3-none-any.whl:

Publisher: publish.yml on EdAbati/dataframes-haystack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page