DocVec CLI is a powerful command-line tool designed to transform your unstructured local documents (PDF, DOCX, TXT) into query-ready vector embeddings, making them instantly usable for Large Language Models (LLMs) and bolstering Retrieval Augmented Generation (RAG) workflows.
Project description
DocVec CLI
🚀 Overview
DocVec CLI is a powerful command-line tool designed to transform your unstructured local documents (PDF, DOCX, TXT) into query-ready vector embeddings, making them instantly usable for Large Language Models (LLMs) and bolstering Retrieval Augmented Generation (RAG) workflows.
✨ Key Features
- Multi-Format Support: Processes
.pdf,.docx, and.txtfiles. - Automatic Text Extraction: Efficiently extracts raw text content from various document types.
- Intelligent Text Cleaning: Removes unnecessary whitespace, excessive newlines, and basic HTML tags.
- Configurable Text Chunking: Uses
langchain'sRecursiveCharacterTextSplitter, with customizablechunk_sizeandchunk_overlap. - Offline Embedding Generation: Uses local
sentence-transformersmodels (default:all-MiniLM-L6-v2) to create high-quality vector embeddings directly on your machine, ensuring privacy and offline capabilities. - ChromaDB-Compatible Output: Generates JSON files structured for easy ingestion into ChromaDB or other vector databases.
- User-Friendly CLI: Simple command-line arguments for input/output paths and processing parameters.
- Progress Indicators: Visual progress bars for long-running operations like embedding generation.
📦 Installation
Prerequisites
- Python 3.8 or newer
Steps
1. Clone the repository:
git clone https://github.com/onurbaran/docvec-cli.git
cd docvec-cli
2. Create and activate a virtual environment:
It’s highly recommended to use a virtual environment to manage dependencies.
python -m venv .venv
# On Windows:
.\.venv\Scripts\activate
# On macOS/Linux:
source ./.venv/bin/activate
3. Install dependencies:
Ensure your requirements.txt contains:
pypdf
python-docx
sentence-transformers
langchain-text-splitters
tqdm
numpy
Then run:
pip install -r requirements.txt
🚀 Usage
Once installed, you can use docvec-cli from your terminal.
Basic Command Structure
python src/main.py --input-path <path_to_document_or_directory> --output-path <path_to_output_directory> [OPTIONS]
Required Arguments
--input-path <path>: Path to a document file (e.g.,report.pdf) or a directory (directory processing is planned for future updates).--output-path <path>: Path to the directory where the generated vector and metadata files will be saved.
Optional Arguments
--chunk-size <int>: Max size of each text chunk in characters (default:1000)--chunk-overlap <int>: Number of characters to overlap between chunks (default:200)--model-name <str>: Sentence-transformers model name (default:all-MiniLM-L6-v2)--output-format <str>: Format for output files (default:json, only format currently supported)
📁 Examples
Process a single PDF file:
python src/main.py --input-path "docs/my_report.pdf" --output-path "vectors/"
Process a DOCX file with custom chunking:
python src/main.py --input-path "articles/research.docx" --output-path "embeddings/" --chunk-size 500 --chunk-overlap 100
Process a TXT file with a different embedding model:
python src/main.py --input-path "notes/daily_journal.txt" --output-path "processed_data/" --model-name "all-MiniLM-L12-v2"
📄 Output File Structure
For each processed document (e.g., my_report.pdf), a JSON file (my_report_vectors.json) will be created in the specified --output-path.
Example content:
[
{
"id": "my_report-0",
"document": "This is the text content of the first chunk...",
"embedding": [0.123, -0.456, ..., 0.789],
"metadata": {
"source_file": "my_report.pdf",
"chunk_index": 0,
"chunk_size": 250
}
}
]
🤝 Contributing
We welcome contributions from the community! To contribute:
- Fork the repository.
- Create a new branch:
git checkout -b feature/your-feature-name - Make your changes.
- Write clear, concise commit messages.
- Push your branch:
git push origin feature/your-feature-name - Open a Pull Request.
Please ensure:
- Your code follows PEP 8
- You include appropriate tests.
📄 License
This project is licensed under the MIT License.
📧 Contact
For questions, feedback, or issues, please open an issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docvec_cli-0.1.0.tar.gz.
File metadata
- Download URL: docvec_cli-0.1.0.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16de78735765cb9348da74881bec826ab53b4dc790e3688a84c65ef5c8dcecf5
|
|
| MD5 |
c9fb88a4acc32bdc0c9400403c5adb78
|
|
| BLAKE2b-256 |
63d8eec1b5318acec9c2ca565189c601a744eb9ad8c1d37ca7c47d393ac4bdb1
|
File details
Details for the file docvec_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: docvec_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae7919ba989b593c9b0d4ab7d41e7acd744f5887104efc3b115c5b613f131463
|
|
| MD5 |
105288adaccfe27c174c49937378f612
|
|
| BLAKE2b-256 |
f8c60a68136f5477a0d8d3c562b971a842c6fd803399bb21334861d2085feed6
|