A powerful and flexible Python tool for text vectorization with multiple embedding models

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

changyy

These details have not been verified by PyPI

Project description

text-vectorify

A powerful and flexible Python tool for text vectorization with multiple embedding models and intelligent caching.

📋 Simple Description

text-vectorify is a command-line tool that converts text data in JSONL format into vector embeddings using various state-of-the-art models including OpenAI, SentenceBERT, BGE, M3E, and HuggingFace transformers. It features intelligent caching, multi-field text combination, and seamless JSONL processing for efficient text analysis pipelines.

🚀 Quick Start

pip install text-vectorify

# Basic usage with default model
text-vectorify \
  --input data.jsonl \
  --input-field-main "title" \
  --input-field-subtitle "content" \
  --process-method "OpenAIEmbedder" \
  --process-extra-data "your-openai-api-key"

# Using stdin input
cat data.jsonl | text-vectorify \
  --input-field-main "title" \
  --process-method "BGEEmbedder"

✨ Features

🎯 Multiple Embedding Models: OpenAI, SentenceBERT, BGE, M3E, HuggingFace
🚄 Intelligent Caching: Avoid recomputing embeddings for duplicate texts
📊 Flexible Field Combination: Combine multiple JSON fields for embedding
📁 JSONL Processing: Seamless input/output in JSONL format
⚡ Batch Processing: Efficient processing of large datasets
🛡️ Error Resilience: Continue processing even if individual records fail
📥 Stdin Support: Read input from pipes or stdin for flexible data processing
🎛️ Smart Defaults: Default model names for quick start without configuration
🔧 Flexible Input: Support file input, stdin, or explicit stdin markers

🔧 Installation

Method 1: pip install (Recommended)

# Install core package only
pip install text-vectorify

# Install with specific embedder support
pip install text-vectorify[openai]              # OpenAI support
pip install text-vectorify[sentence-transformers] # SentenceBERT, BGE, M3E support
pip install text-vectorify[huggingface]         # HuggingFace support
pip install text-vectorify[all]                 # All embedding models

# Install with development dependencies
pip install text-vectorify[dev]

Method 2: From source

# Clone repository
git clone https://github.com/changyy/py-text-vectorify.git
cd py-text-vectorify

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install package
make install          # Core package only
# or
make install-dev      # With development dependencies  
# or
make install-all      # With all optional dependencies

Method 3: Development setup

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
make test

# Format code
make format

Install additional packages based on the embedding models you plan to use:

# For OpenAI embeddings
pip install openai

# For SentenceBERT, BGE, M3E models
pip install sentence-transformers

# For HuggingFace models
pip install transformers torch

📝 Usage

Command Line Interface

text-vectorify [OPTIONS]

Required Arguments

--input-field-main: Main text fields (comma-separated)
--process-method: Embedding method to use

Optional Arguments

--input: Path to input JSONL file (use "-" for stdin, or omit to read from stdin)
--process-model-name: Model name to use (optional, will use defaults if not specified)
--input-field-subtitle: Additional text fields (comma-separated)
--process-extra-data: Extra data like API keys
--output-field: Output vector field name (default: "vector")
--output-cache-dir: Cache directory (default: "./cache")
--output: Output file path (default: auto-generated)

Quick Start Features

The tool now supports smart defaults and flexible input methods for easier usage:

Default Models

Each embedder has intelligent default models, so you don't need to specify --process-model-name:

OpenAI: text-embedding-3-small
BGE: BAAI/bge-small-en-v1.5
SentenceBERT: paraphrase-multilingual-MiniLM-L12-v2
M3E: moka-ai/m3e-base
HuggingFace: sentence-transformers/all-MiniLM-L6-v2

Flexible Input Methods

File input: --input data.jsonl
Stdin (auto-detect): cat data.jsonl | text-vectorify ...
Explicit stdin: --input -

Minimal Example

# The simplest possible usage
cat data.jsonl | text-vectorify --input-field-main "title" --process-method "BGEEmbedder"

Input Format

JSONL file with text data:

{"title": "Sample Article", "content": "This is the content...", "author": "John Doe"}
{"title": "Another Article", "content": "More content here...", "author": "Jane Smith"}

Output Format

JSONL file with added vector embeddings:

{"title": "Sample Article", "content": "This is the content...", "author": "John Doe", "vector": [0.1, 0.2, 0.3, ...]}
{"title": "Another Article", "content": "More content here...", "author": "Jane Smith", "vector": [0.4, 0.5, 0.6, ...]}

🤖 Supported Models

OpenAI Embeddings

Default Model: text-embedding-3-small
Other Models: text-embedding-3-large
API Key: Required via --process-extra-data
Dimensions: 1536 (small), 3072 (large)

SentenceBERT

Default Model: paraphrase-multilingual-MiniLM-L12-v2
Language: Multilingual support
Dimensions: 384

BGE (Beijing Academy of AI)

Default Model: BAAI/bge-small-en-v1.5
Other Models: BAAI/bge-base-zh-v1.5, BAAI/bge-small-zh-v1.5
Language: Optimized for Chinese and English
Dimensions: 512 (small), 768 (base)

M3E (Moka Massive Mixed Embedding)

Default Model: moka-ai/m3e-base
Other Models: moka-ai/m3e-small
Language: Chinese specialized
Dimensions: 768 (base), 512 (small)

HuggingFace Transformers

Default Model: sentence-transformers/all-MiniLM-L6-v2
Flexibility: Custom model selection
Dimensions: Model-dependent

📚 Examples

Example 1: OpenAI Embeddings (with default model)

text-vectorify \
  --input articles.jsonl \
  --input-field-main "title" \
  --input-field-subtitle "content,summary" \
  --process-method "OpenAIEmbedder" \
  --process-extra-data "sk-your-openai-api-key" \
  --output-field "embedding" \
  --output processed_articles.jsonl

Example 2: Using stdin input with default BGE model

cat chinese_news.jsonl | text-vectorify \
  --input-field-main "title,content" \
  --process-method "BGEEmbedder" \
  --output-cache-dir ./models_cache

Example 3: Multilingual with SentenceBERT (default model)

text-vectorify \
  --input multilingual_docs.jsonl \
  --input-field-main "title" \
  --input-field-subtitle "description,tags" \
  --process-method "SentenceBertEmbedder"

Example 4: Custom model specification

text-vectorify \
  --input products.jsonl \
  --input-field-main "name,brand" \
  --input-field-subtitle "description,category,tags" \
  --process-method "BGEEmbedder" \
  --process-model-name "BAAI/bge-base-zh-v1.5" \
  --output-field "product_vector"

Example 5: Explicit stdin marker

echo '{"title": "Sample", "content": "Text content"}' | text-vectorify \
  --input - \
  --input-field-main "title" \
  --process-method "M3EEmbedder"

Example 6: Quick start with minimal arguments

# Most minimal usage - using defaults
cat data.jsonl | text-vectorify \
  --input-field-main "title" \
  --process-method "BGEEmbedder"

🔧 API Reference

Python API Usage

from text_vectorify import TextVectorify, EmbedderFactory

# Create embedder
embedder = EmbedderFactory.create_embedder(
    "OpenAIEmbedder",
    "text-embedding-3-small",
    api_key="your-api-key"
)

# Initialize vectorizer
vectorizer = TextVectorify(embedder)

# Process JSONL file
vectorizer.process_jsonl(
    input_path="input.jsonl",
    output_path="output.jsonl",
    input_field_main=["title"],
    input_field_subtitle=["content"],
    output_field="vector"
)

Available Embedders

from text_vectorify import EmbedderFactory

# List all available embedders
embedders = EmbedderFactory.list_embedders()
print(embedders)
# ['OpenAIEmbedder', 'SentenceBertEmbedder', 'BGEEmbedder', 'M3EEmbedder', 'HuggingFaceEmbedder']

⚙️ Configuration

Cache Management

The tool automatically caches:

Text embeddings: Avoid recomputing identical texts
Model files: Download models once and reuse
Cache location: Configurable via --output-cache-dir

Cache Structure

cache/
├── models/                 # Downloaded models
│   ├── sentence_transformers/
│   ├── huggingface/
│   └── bge/
└── [hash].pkl            # Cached embeddings

Environment Variables

export OPENAI_API_KEY="your-openai-api-key"  # For OpenAI embeddings

🔍 Performance Tips

Use caching: Enable caching to avoid recomputing embeddings
Batch processing: Process large files in chunks
Model selection: Choose appropriate model for your language and use case
Field combination: Combine relevant fields for better semantic representation
Stdin processing: Use stdin for pipeline integration and memory efficiency
Default models: Start with default models for quick prototyping, then customize as needed

🐛 Troubleshooting

Common Issues

Import Error: Missing dependencies

pip install sentence-transformers transformers torch openai

API Key Error: Invalid or missing OpenAI API key

export OPENAI_API_KEY="your-valid-api-key"

Memory Error: Large models on limited RAM

Use smaller models like bge-small-zh-v1.5
Process files in smaller batches

Cache Permission Error: Insufficient cache directory permissions

chmod 755 ./cache

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

git clone https://github.com/changyy/py-text-vectorify.git
cd py-text-vectorify
pip install -e .
pip install -r requirements-dev.txt

Running Tests

pytest tests/

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📊 Benchmarks

Model	Language	Dimension	Speed	Quality
OpenAI text-embedding-3-small	Multi	1536	Fast	Excellent
BGE-base-zh	Chinese	768	Medium	Excellent
SentenceBERT	Multi	384	Fast	Good
M3E-base	Chinese	768	Medium	Excellent

🔗 Related Projects

📞 Support

GitHub Issues: Report bugs or request features
Documentation: Full documentation

Made with ❤️ for the text analysis community

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

changyy

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.2.0

Jun 7, 2025

1.1.3

Jun 3, 2025

1.1.2

Jun 3, 2025

1.1.1

Jun 3, 2025

1.1.0

Jun 3, 2025

This version

1.0.0

Jun 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_vectorify-1.0.0.tar.gz (20.6 kB view details)

Uploaded Jun 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

text_vectorify-1.0.0-py3-none-any.whl (18.6 kB view details)

Uploaded Jun 2, 2025 Python 3

File details

Details for the file text_vectorify-1.0.0.tar.gz.

File metadata

Download URL: text_vectorify-1.0.0.tar.gz
Upload date: Jun 2, 2025
Size: 20.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for text_vectorify-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`4fe64281c4d480b6632157b5775f1bb463e7f2e03d56df18164bb767ab12641a`
MD5	`38233695a4fa7bf2d7161485f459a900`
BLAKE2b-256	`bb8019b73ba8e68ef304099a491a84e98ce69d33b7e935eceb3522e2ca07977d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for text_vectorify-1.0.0.tar.gz:

Publisher: python-publish.yml on changyy/py-text-vectorify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: text_vectorify-1.0.0.tar.gz
- Subject digest: 4fe64281c4d480b6632157b5775f1bb463e7f2e03d56df18164bb767ab12641a
- Sigstore transparency entry: 227790659
- Sigstore integration time: Jun 2, 2025
Source repository:
- Permalink: changyy/py-text-vectorify@5d6d67e7744c0dcfc95494e7a46a1a391023ce3c
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/changyy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@5d6d67e7744c0dcfc95494e7a46a1a391023ce3c
- Trigger Event: release

File details

Details for the file text_vectorify-1.0.0-py3-none-any.whl.

File metadata

Download URL: text_vectorify-1.0.0-py3-none-any.whl
Upload date: Jun 2, 2025
Size: 18.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for text_vectorify-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`18e0d6d703ae1812422387dbadb7d1537647aca6d81296877b53901bd2423584`
MD5	`7828f51b028067a22524076ff6d5fd9d`
BLAKE2b-256	`f91c639a1e21f012e861b1a5b800a8e0917fdb4c950ad4bf41d2931d57514642`

See more details on using hashes here.

Provenance

The following attestation bundles were made for text_vectorify-1.0.0-py3-none-any.whl:

Publisher: python-publish.yml on changyy/py-text-vectorify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: text_vectorify-1.0.0-py3-none-any.whl
- Subject digest: 18e0d6d703ae1812422387dbadb7d1537647aca6d81296877b53901bd2423584
- Sigstore transparency entry: 227790671
- Sigstore integration time: Jun 2, 2025
Source repository:
- Permalink: changyy/py-text-vectorify@5d6d67e7744c0dcfc95494e7a46a1a391023ce3c
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/changyy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@5d6d67e7744c0dcfc95494e7a46a1a391023ce3c
- Trigger Event: release

text-vectorify 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

text-vectorify

📋 Simple Description

🚀 Quick Start

✨ Features

📖 Table of Contents

🔧 Installation

Method 1: pip install (Recommended)

Method 2: From source

Method 3: Development setup

📝 Usage

Command Line Interface

Required Arguments

Optional Arguments

Quick Start Features

Default Models

Flexible Input Methods

Minimal Example

Input Format

Output Format

🤖 Supported Models

OpenAI Embeddings

SentenceBERT

BGE (Beijing Academy of AI)

M3E (Moka Massive Mixed Embedding)

HuggingFace Transformers

📚 Examples

Example 1: OpenAI Embeddings (with default model)

Example 2: Using stdin input with default BGE model

Example 3: Multilingual with SentenceBERT (default model)

Example 4: Custom model specification

Example 5: Explicit stdin marker

Example 6: Quick start with minimal arguments

🔧 API Reference

Python API Usage

Available Embedders

⚙️ Configuration

Cache Management

Cache Structure

Environment Variables

🔍 Performance Tips

🐛 Troubleshooting

Common Issues

🤝 Contributing

Development Setup

Running Tests

📄 License

📊 Benchmarks

🔗 Related Projects

📞 Support

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance