A library/cli that allows you to vectorize your data, enabling you to create RAG powered applications.

Project description

Overview

This tool, docs2vecs is a library/cli that allows you to vectorize your data, enabling you to create RAG powered applications.

data_ingestion

For these applications, docs2vecs simplifies the entire process:

Data ingestion: Use the indexer to run the data ingestion pipeline: data retrieval, chunking, embedding, and storing resulting vectors in a Vector DB.
Build proof of concepts: docs2vecs allows you to quickly create a RAG prototype by using a local ChromaDB as vector store and a server mode to chat with your data.

The docs2vecs project is managed with uv.

Usage

You can use docs2vecs in three ways:

Install from PyPI
Install locally from source
Run from Docker/Podman image.

Install from PyPI

You can install docs2vecs from PyPI using pip:

pip install docs2vecs

pip install docs2vecs[all]

to install all the extra dependencies.

Run locally from source

gh repo clone AmadeusITGroup/docs2vecs
cd docs2vecs
uv run --directory src docs2vecs --help

Run from Docker image

export OCI_ENGINE=podman # or docker
export DOCS2VECS_VERSION=latest # or a specific version
${OCI_ENGINE}  run -it --rm \
    ghcr.io/amadeusitgroup/docs2vecs:latest \
    --help # or any other valid command that can be run with docs2vecs

Documentation

Expand me if you would like to find out how to vectorize your data

Indexer sub-command

The indexer sub-command runs an indexer pipeline configured in a config file. This is usually used when you have a lot of data to vectorize and want to run it in a batch.

uv run --directory src docs2vecs indexer --help

usage: docs2vecs indexer [-h] --config CONFIG [--env ENV]
options:
--config CONFIG  Path to the YAML configuration file.
--env ENV        Environment file to load.

The indexer takes in input two arguments: a mandatory config file, and an optional environment file.

In the config file you'll need to define a list of skills, a skillset, and an indexer. Note that you may define plenty of skills, but only those enumerated in the skillset will be executed in sequence.

Example:

uv run --directory src docs2vecs indexer --config ~/Downloads/sw_export_temp/config/confluence_process.yml --env ~/indexer.env

Please check the detailed skills documentation.

The config yaml file is validated against this schema.

Please check this sample config file.

Expand me if you would like to find out how to chat with your data

Server sub-command

If you previously indexed your data (refer to the previous section) and stored the outputted embeddings in a local ChromaDB, you can chat with your data using the server sub-command.

uv run --directory src docs2vecs server --help

usage: docs2vecs server [-h] [--host HOST] [--port PORT] [--model MODEL] [--cache_dir CACHE_DIR] [--path PATH]
                        [--workers WORKERS] [--log_level LOG_LEVEL] [--env ENV]

options:
  -h, --help            show this help message and exit
  --host HOST           A host for the server.
  --port PORT           A port for the server.
  --model MODEL         A name of the embedding model(as per huggingface coordinates).
  --cache_dir CACHE_DIR
                        A path to the cache directory.
  --path PATH           A path for the server.
  --workers WORKERS     Number of workers for the server.
  --log_level LOG_LEVEL
                        Log level for the server.
  --env ENV             Environment file to load.

By default, the host is localhost and the port is 8008.

Example:

uv run --directory src docs2vecs server --path path/to/where/your/chroma/db/is

By then typing http://localhost:8008/ in your browser, you sould be able to see the embedding collections stored in your vector store and perform Knn search based on user query. You can modify the K number of nearest neighbours returned by the semantic search.

Expand me if you would like to find out how create an integrated vectorization in Azure

Integrated Vectorization sub-command

integrated_vec - Run an integrated vectorization pipeline configured in a config file.

uv run --directory src docs2vecs integrated_vec --help

usage: docs2vecs integrated_vec [-h] --config CONFIG [--env ENV]
options:
--config CONFIG  Path to the YAML configuration file.
--env ENV        Environment file to load.

Example:

uv run --directory src docs2vecs integrated_vec --config ~/Downloads/sw_export_temp/config/config.yaml --env ~/integrated_vec .env

The config yaml file is validated against this schema.

Config yml file sample:

---
integrated_vec:
    id: AzureAISearchIndexer
    skill:
        type: integrated_vec
        name: AzureAISearchIntegratedVectorization
        params:
            search_ai_api_key: env.AZURE_AI_SEARCH_API_KEY
            search_ai_endpoint: http://replace.me.with.your.endpoint
            embedding_endpoint: http://replace.me.with.your.endpoint
            index_name: your_index_name
            indexer_name: new_indexer_name
            skillset_name: new_skillset_name
            data_source_connection_string: ResourceId=/subscriptions/your_subscription_id/resourceGroups/resource_group_name/providers/Microsoft.Storage/storageAccounts/storage_account_name;
            data_source_connection_name: new_connection_name
            encryption_key: env.AZURE_AI_SEARCH_ENCRYPTION_KEY
            container_name: your_container_name

Important note:

Please note that api keys should NOT be stored in config files, and should NOT be added to git. Therefore, if you build your config file, use the env. prefix for api_key parameter. For example: api_key: env.AZURE_AI_SEARCH_API_KEY.

Make sure you export the environment variables before you run the indexer. For convenience you can use the --env argument to supply your own .env file.

Experimental features

Tracker

The tracker feature allows you to monitor and manage the status of documents processed by the indexer. This is particularly useful for tracking failed documents and retrying their processing.

To achieve this, the tracker needs a MongoDB connection, which can be defined in the input config file.

The way it works is that each document in MongoDB has a chunk part having a document_id. This document_id is actually the hash of the content for that chunk. So, as long as the content is the same, the hash will stay the same. Besides this, there is a status property that keeps track whether the upload to vector store was successful or not.

If you'd like to use a different database to keep track of this, you'll have to write your own "driver" similar to the existing mongodb. Then you need to add it to the DBFactory.

Development

To run tests with pytest:

uv python install 3.11
uv sync --all-extras --dev
uv run pytest tests

It is also possible to use tox::

uv pip install tox
uv run tox

Note, to combine the coverage data from all the tox environments run:

OS	Command
Windows	`set PYTEST_ADDOPTS=--cov-append tox`
Other	`PYTEST_ADDOPTS=--cov-append tox`

Releasing

To release a new version of the package, you can create a pre-release from the main branch using GitHub UI, which will then trigger the release workflow. Alternatively, you can use the gh command line tool to create a release:

gh release create v[a.b.c] --prerelease --title "Kick starting the release"  --target main

Contributing

We welcome contributions to the docs2vecs project! If you have an idea for a new feature, bug fix, or improvement, please open an issue or submit a pull request. Before contributing, please read our contributing guidelines.

Project details

Release history Release notifications | RSS feed

0.0.13

Mar 5, 2026

0.0.12

Feb 16, 2026

0.0.11

Feb 13, 2026

0.0.10

Feb 12, 2026

0.0.9

Feb 12, 2026

0.0.8

Apr 16, 2025

0.0.6

Mar 23, 2025

0.0.5

Mar 23, 2025

This version

0.0.4

Mar 23, 2025

0.0.3

Mar 23, 2025

0.0.2

Mar 23, 2025

0.0.1

Mar 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docs2vecs-0.0.4.tar.gz (1.6 MB view details)

Uploaded Mar 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docs2vecs-0.0.4-py3-none-any.whl (43.2 kB view details)

Uploaded Mar 23, 2025 Python 3

File details

Details for the file docs2vecs-0.0.4.tar.gz.

File metadata

Download URL: docs2vecs-0.0.4.tar.gz
Upload date: Mar 23, 2025
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docs2vecs-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`01fb06b55494b668afd23bc9b428ba233664e45e131cfd26027db802e53f194f`
MD5	`77b5dd77d1c6f63c65c7617cc6fd01e4`
BLAKE2b-256	`5c0ddfb258ef32ca64810f52a2407eff79377203560d09d1c8d0e9ab6d81df11`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docs2vecs-0.0.4.tar.gz:

Publisher: release-pypi.yml on AmadeusITGroup/docs2vecs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docs2vecs-0.0.4.tar.gz
- Subject digest: 01fb06b55494b668afd23bc9b428ba233664e45e131cfd26027db802e53f194f
- Sigstore transparency entry: 186830410
- Sigstore integration time: Mar 23, 2025
Source repository:
- Permalink: AmadeusITGroup/docs2vecs@fe77cd0a277ca7015bdf1ed41f46731782d15de7
- Branch / Tag: refs/tags/v0.0.4
- Owner: https://github.com/AmadeusITGroup
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-pypi.yml@fe77cd0a277ca7015bdf1ed41f46731782d15de7
- Trigger Event: push

File details

Details for the file docs2vecs-0.0.4-py3-none-any.whl.

File metadata

Download URL: docs2vecs-0.0.4-py3-none-any.whl
Upload date: Mar 23, 2025
Size: 43.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docs2vecs-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4729c0c2196f6802f8960b685d482cd8b7ce0107e5f6b4758c0ce2bddce7e5cb`
MD5	`951f8137aec05d7fdb3878edb5e45af3`
BLAKE2b-256	`aae91e997fed903f8a2626522112caaefb058778305cab6ecb0faedfbdcf8133`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docs2vecs-0.0.4-py3-none-any.whl:

Publisher: release-pypi.yml on AmadeusITGroup/docs2vecs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docs2vecs-0.0.4-py3-none-any.whl
- Subject digest: 4729c0c2196f6802f8960b685d482cd8b7ce0107e5f6b4758c0ce2bddce7e5cb
- Sigstore transparency entry: 186830411
- Sigstore integration time: Mar 23, 2025
Source repository:
- Permalink: AmadeusITGroup/docs2vecs@fe77cd0a277ca7015bdf1ed41f46731782d15de7
- Branch / Tag: refs/tags/v0.0.4
- Owner: https://github.com/AmadeusITGroup
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-pypi.yml@fe77cd0a277ca7015bdf1ed41f46731782d15de7
- Trigger Event: push

docs2vecs 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Overview

Usage

Install from PyPI

Run locally from source

Run from Docker image

Documentation

Indexer sub-command

Server sub-command

Integrated Vectorization sub-command

Important note:

Experimental features

Tracker

Development

Releasing

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance