DocVec CLI is a powerful command-line tool designed to transform your unstructured local documents (PDF, DOCX, TXT) into query-ready vector embeddings, making them instantly usable for Large Language Models (LLMs) and bolstering Retrieval Augmented Generation (RAG) workflows.

These details have not been verified by PyPI

Project links

Homepage

Project description

DocVec CLI

🚀 Overview
DocVec CLI is a powerful command-line tool designed to transform your unstructured local documents (PDF, DOCX, TXT) into query-ready vector embeddings, making them instantly usable for Large Language Models (LLMs) and bolstering Retrieval Augmented Generation (RAG) workflows.

✨ Key Features

Multi-Format Support: Processes .pdf, .docx, and .txt files.
Automatic Text Extraction: Efficiently extracts raw text content from various document types.
Intelligent Text Cleaning: Removes unnecessary whitespace, excessive newlines, and basic HTML tags.
Configurable Text Chunking: Uses langchain's RecursiveCharacterTextSplitter, with customizable chunk_size and chunk_overlap.
Offline Embedding Generation: Uses local sentence-transformers models (default: all-MiniLM-L6-v2) to create high-quality vector embeddings directly on your machine, ensuring privacy and offline capabilities.
ChromaDB-Compatible Output: Generates JSON files structured for easy ingestion into ChromaDB or other vector databases.
User-Friendly CLI: Simple command-line arguments for input/output paths and processing parameters.
Progress Indicators: Visual progress bars for long-running operations like embedding generation.

📦 Installation

Prerequisites

Python 3.8 or newer

Steps

1. Clone the repository:

git clone https://github.com/onurbaran/docvec-cli.git  
cd docvec-cli

2. Create and activate a virtual environment:

It’s highly recommended to use a virtual environment to manage dependencies.

python -m venv .venv

# On Windows:
.\.venv\Scripts\activate

# On macOS/Linux:
source ./.venv/bin/activate

3. Install dependencies:

Ensure your requirements.txt contains:

pypdf
python-docx
sentence-transformers
langchain-text-splitters
tqdm
numpy

Then run:

pip install -r requirements.txt

🚀 Usage

Once installed, you can use docvec-cli from your terminal.

Basic Command Structure

python src/main.py --input-path <path_to_document_or_directory> --output-path <path_to_output_directory> [OPTIONS]

Required Arguments

--input-path <path>: Path to a document file (e.g., report.pdf) or a directory (directory processing is planned for future updates).
--output-path <path>: Path to the directory where the generated vector and metadata files will be saved.

Optional Arguments

--chunk-size <int>: Max size of each text chunk in characters (default: 1000)
--chunk-overlap <int>: Number of characters to overlap between chunks (default: 200)
--model-name <str>: Sentence-transformers model name (default: all-MiniLM-L6-v2)
--output-format <str>: Format for output files (default: json, only format currently supported)

📁 Examples

Process a single PDF file:

python src/main.py --input-path "docs/my_report.pdf" --output-path "vectors/"

Process a DOCX file with custom chunking:

python src/main.py --input-path "articles/research.docx" --output-path "embeddings/" --chunk-size 500 --chunk-overlap 100

Process a TXT file with a different embedding model:

python src/main.py --input-path "notes/daily_journal.txt" --output-path "processed_data/" --model-name "all-MiniLM-L12-v2"

📄 Output File Structure

For each processed document (e.g., my_report.pdf), a JSON file (my_report_vectors.json) will be created in the specified --output-path.

Example content:

[
  {
    "id": "my_report-0",
    "document": "This is the text content of the first chunk...",
    "embedding": [0.123, -0.456, ..., 0.789],
    "metadata": {
      "source_file": "my_report.pdf",
      "chunk_index": 0,
      "chunk_size": 250
    }
  }
]

🤝 Contributing

We welcome contributions from the community! To contribute:

Fork the repository.
Create a new branch: git checkout -b feature/your-feature-name
Make your changes.
Write clear, concise commit messages.
Push your branch: git push origin feature/your-feature-name
Open a Pull Request.

Please ensure:

Your code follows PEP 8
You include appropriate tests.

📄 License

This project is licensed under the MIT License.

📧 Contact

For questions, feedback, or issues, please open an issue.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

May 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docvec_cli-0.1.0.tar.gz (8.9 kB view details)

Uploaded May 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docvec_cli-0.1.0-py3-none-any.whl (9.6 kB view details)

Uploaded May 24, 2025 Python 3

File details

Details for the file docvec_cli-0.1.0.tar.gz.

File metadata

Download URL: docvec_cli-0.1.0.tar.gz
Upload date: May 24, 2025
Size: 8.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for docvec_cli-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`16de78735765cb9348da74881bec826ab53b4dc790e3688a84c65ef5c8dcecf5`
MD5	`c9fb88a4acc32bdc0c9400403c5adb78`
BLAKE2b-256	`63d8eec1b5318acec9c2ca565189c601a744eb9ad8c1d37ca7c47d393ac4bdb1`

See more details on using hashes here.

File details

Details for the file docvec_cli-0.1.0-py3-none-any.whl.

File metadata

Download URL: docvec_cli-0.1.0-py3-none-any.whl
Upload date: May 24, 2025
Size: 9.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for docvec_cli-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ae7919ba989b593c9b0d4ab7d41e7acd744f5887104efc3b115c5b613f131463`
MD5	`105288adaccfe27c174c49937378f612`
BLAKE2b-256	`f8c60a68136f5477a0d8d3c562b971a842c6fd803399bb21334861d2085feed6`

See more details on using hashes here.

docvec-cli 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DocVec CLI

🚀 Overview DocVec CLI is a powerful command-line tool designed to transform your unstructured local documents (PDF, DOCX, TXT) into query-ready vector embeddings, making them instantly usable for Large Language Models (LLMs) and bolstering Retrieval Augmented Generation (RAG) workflows.

✨ Key Features

📦 Installation

Prerequisites

Steps

1. Clone the repository:

2. Create and activate a virtual environment:

3. Install dependencies:

🚀 Usage

Basic Command Structure

Required Arguments

Optional Arguments

📁 Examples

Process a single PDF file:

Process a DOCX file with custom chunking:

Process a TXT file with a different embedding model:

📄 Output File Structure

🤝 Contributing

📄 License

📧 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

🚀 Overview
DocVec CLI is a powerful command-line tool designed to transform your unstructured local documents (PDF, DOCX, TXT) into query-ready vector embeddings, making them instantly usable for Large Language Models (LLMs) and bolstering Retrieval Augmented Generation (RAG) workflows.