Skip to main content

A powerful and flexible Python tool for text vectorization with multiple embedding models

Project description

text-vectorify

PyPI version Python 3.8+ License: MIT PyPI Downloads

A powerful and flexible Python tool for text vectorization with multiple embedding models and intelligent caching.

๐Ÿ“‹ Simple Description

text-vectorify is a command-line tool that converts text data in JSONL format into vector embeddings using various state-of-the-art models including OpenAI, SentenceBERT, BGE, M3E, and HuggingFace transformers. It features intelligent caching, multi-field text combination, and seamless JSONL processing for efficient text analysis pipelines.

๐Ÿš€ Quick Start

pip install text-vectorify

# Basic usage with default model
text-vectorify \
  --input data.jsonl \
  --input-field-main "title" \
  --input-field-subtitle "content" \
  --process-method "OpenAIEmbedder" \
  --process-extra-data "your-openai-api-key"

# Using stdin input
cat data.jsonl | text-vectorify \
  --input-field-main "title" \
  --process-method "BGEEmbedder"

โœจ Features

  • ๐ŸŽฏ Multiple Embedding Models: OpenAI, SentenceBERT, BGE, M3E, HuggingFace
  • ๐Ÿš„ Intelligent Caching: Avoid recomputing embeddings for duplicate texts
  • ๐Ÿ“Š Flexible Field Combination: Combine multiple JSON fields for embedding
  • ๐Ÿ“ JSONL Processing: Seamless input/output in JSONL format
  • โšก Batch Processing: Efficient processing of large datasets
  • ๐Ÿ›ก๏ธ Error Resilience: Continue processing even if individual records fail
  • ๐Ÿ“ฅ Stdin Support: Read input from pipes or stdin for flexible data processing
  • ๐ŸŽ›๏ธ Smart Defaults: Default model names for quick start without configuration
  • ๐Ÿ”ง Flexible Input: Support file input, stdin, or explicit stdin markers

๐Ÿ“– Table of Contents

๐Ÿ”ง Installation

Method 1: pip install (Recommended)

# Install core package only
pip install text-vectorify

# Install with specific embedder support
pip install text-vectorify[openai]              # OpenAI support
pip install text-vectorify[sentence-transformers] # SentenceBERT, BGE, M3E support
pip install text-vectorify[huggingface]         # HuggingFace support
pip install text-vectorify[all]                 # All embedding models

# Install with development dependencies
pip install text-vectorify[dev]

Method 2: From source

# Clone repository
git clone https://github.com/changyy/py-text-vectorify.git
cd py-text-vectorify

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install package
make install          # Core package only
# or
make install-dev      # With development dependencies  
# or
make install-all      # With all optional dependencies

Method 3: Development setup

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
make test

# Format code
make format

Install additional packages based on the embedding models you plan to use:

# For OpenAI embeddings
pip install openai

# For SentenceBERT, BGE, M3E models
pip install sentence-transformers

# For HuggingFace models
pip install transformers torch

๐Ÿ“ Usage

Command Line Interface

text-vectorify [OPTIONS]

Required Arguments

  • --input-field-main: Main text fields (comma-separated)
  • --process-method: Embedding method to use

Optional Arguments

  • --input: Path to input JSONL file (use "-" for stdin, or omit to read from stdin)
  • --process-model-name: Model name to use (optional, will use defaults if not specified)
  • --input-field-subtitle: Additional text fields (comma-separated)
  • --process-extra-data: Extra data like API keys
  • --output-field: Output vector field name (default: "vector")
  • --output-cache-dir: Cache directory (default: "./cache")
  • --output: Output file path (default: auto-generated)

Quick Start Features

The tool now supports smart defaults and flexible input methods for easier usage:

Default Models

Each embedder has intelligent default models, so you don't need to specify --process-model-name:

  • OpenAI: text-embedding-3-small
  • BGE: BAAI/bge-small-en-v1.5
  • SentenceBERT: paraphrase-multilingual-MiniLM-L12-v2
  • M3E: moka-ai/m3e-base
  • HuggingFace: sentence-transformers/all-MiniLM-L6-v2

Flexible Input Methods

  • File input: --input data.jsonl
  • Stdin (auto-detect): cat data.jsonl | text-vectorify ...
  • Explicit stdin: --input -

Minimal Example

# The simplest possible usage
cat data.jsonl | text-vectorify --input-field-main "title" --process-method "BGEEmbedder"

Input Format

JSONL file with text data:

{"title": "Sample Article", "content": "This is the content...", "author": "John Doe"}
{"title": "Another Article", "content": "More content here...", "author": "Jane Smith"}

Output Format

JSONL file with added vector embeddings:

{"title": "Sample Article", "content": "This is the content...", "author": "John Doe", "vector": [0.1, 0.2, 0.3, ...]}
{"title": "Another Article", "content": "More content here...", "author": "Jane Smith", "vector": [0.4, 0.5, 0.6, ...]}

๐Ÿค– Supported Models

OpenAI Embeddings

  • Default Model: text-embedding-3-small
  • Other Models: text-embedding-3-large
  • API Key: Required via --process-extra-data
  • Dimensions: 1536 (small), 3072 (large)

SentenceBERT

  • Default Model: paraphrase-multilingual-MiniLM-L12-v2
  • Language: Multilingual support
  • Dimensions: 384

BGE (Beijing Academy of AI)

  • Default Model: BAAI/bge-small-en-v1.5
  • Other Models: BAAI/bge-base-zh-v1.5, BAAI/bge-small-zh-v1.5
  • Language: Optimized for Chinese and English
  • Dimensions: 512 (small), 768 (base)

M3E (Moka Massive Mixed Embedding)

  • Default Model: moka-ai/m3e-base
  • Other Models: moka-ai/m3e-small
  • Language: Chinese specialized
  • Dimensions: 768 (base), 512 (small)

HuggingFace Transformers

  • Default Model: sentence-transformers/all-MiniLM-L6-v2
  • Flexibility: Custom model selection
  • Dimensions: Model-dependent

๐Ÿ“š Examples

Example 1: OpenAI Embeddings (with default model)

text-vectorify \
  --input articles.jsonl \
  --input-field-main "title" \
  --input-field-subtitle "content,summary" \
  --process-method "OpenAIEmbedder" \
  --process-extra-data "sk-your-openai-api-key" \
  --output-field "embedding" \
  --output processed_articles.jsonl

Example 2: Using stdin input with default BGE model

cat chinese_news.jsonl | text-vectorify \
  --input-field-main "title,content" \
  --process-method "BGEEmbedder" \
  --output-cache-dir ./models_cache

Example 3: Multilingual with SentenceBERT (default model)

text-vectorify \
  --input multilingual_docs.jsonl \
  --input-field-main "title" \
  --input-field-subtitle "description,tags" \
  --process-method "SentenceBertEmbedder"

Example 4: Custom model specification

text-vectorify \
  --input products.jsonl \
  --input-field-main "name,brand" \
  --input-field-subtitle "description,category,tags" \
  --process-method "BGEEmbedder" \
  --process-model-name "BAAI/bge-base-zh-v1.5" \
  --output-field "product_vector"

Example 5: Explicit stdin marker

echo '{"title": "Sample", "content": "Text content"}' | text-vectorify \
  --input - \
  --input-field-main "title" \
  --process-method "M3EEmbedder"

Example 6: Quick start with minimal arguments

# Most minimal usage - using defaults
cat data.jsonl | text-vectorify \
  --input-field-main "title" \
  --process-method "BGEEmbedder"

๐Ÿ”ง API Reference

Python API Usage

from text_vectorify import TextVectorify, EmbedderFactory

# Create embedder
embedder = EmbedderFactory.create_embedder(
    "OpenAIEmbedder",
    "text-embedding-3-small",
    api_key="your-api-key"
)

# Initialize vectorizer
vectorizer = TextVectorify(embedder)

# Process JSONL file
vectorizer.process_jsonl(
    input_path="input.jsonl",
    output_path="output.jsonl",
    input_field_main=["title"],
    input_field_subtitle=["content"],
    output_field="vector"
)

Available Embedders

from text_vectorify import EmbedderFactory

# List all available embedders
embedders = EmbedderFactory.list_embedders()
print(embedders)
# ['OpenAIEmbedder', 'SentenceBertEmbedder', 'BGEEmbedder', 'M3EEmbedder', 'HuggingFaceEmbedder']

โš™๏ธ Configuration

Cache Management

The tool automatically caches:

  • Text embeddings: Avoid recomputing identical texts
  • Model files: Download models once and reuse
  • Cache location: Configurable via --output-cache-dir

Cache Structure

cache/
โ”œโ”€โ”€ models/                 # Downloaded models
โ”‚   โ”œโ”€โ”€ sentence_transformers/
โ”‚   โ”œโ”€โ”€ huggingface/
โ”‚   โ””โ”€โ”€ bge/
โ””โ”€โ”€ [hash].pkl            # Cached embeddings

Environment Variables

export OPENAI_API_KEY="your-openai-api-key"  # For OpenAI embeddings

๐Ÿ” Performance Tips

  1. Use caching: Enable caching to avoid recomputing embeddings
  2. Batch processing: Process large files in chunks
  3. Model selection: Choose appropriate model for your language and use case
  4. Field combination: Combine relevant fields for better semantic representation
  5. Stdin processing: Use stdin for pipeline integration and memory efficiency
  6. Default models: Start with default models for quick prototyping, then customize as needed

๐Ÿ› Troubleshooting

Common Issues

Import Error: Missing dependencies

pip install sentence-transformers transformers torch openai

API Key Error: Invalid or missing OpenAI API key

export OPENAI_API_KEY="your-valid-api-key"

Memory Error: Large models on limited RAM

  • Use smaller models like bge-small-zh-v1.5
  • Process files in smaller batches

Cache Permission Error: Insufficient cache directory permissions

chmod 755 ./cache

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

git clone https://github.com/changyy/py-text-vectorify.git
cd py-text-vectorify
pip install -e .
pip install -r requirements-dev.txt

Running Tests

pytest tests/

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ“Š Benchmarks

Model Language Dimension Speed Quality
OpenAI text-embedding-3-small Multi 1536 Fast Excellent
BGE-base-zh Chinese 768 Medium Excellent
SentenceBERT Multi 384 Fast Good
M3E-base Chinese 768 Medium Excellent

๐Ÿ”— Related Projects

๐Ÿ“ž Support


Made with โค๏ธ for the text analysis community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_vectorify-1.0.0.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text_vectorify-1.0.0-py3-none-any.whl (18.6 kB view details)

Uploaded Python 3

File details

Details for the file text_vectorify-1.0.0.tar.gz.

File metadata

  • Download URL: text_vectorify-1.0.0.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for text_vectorify-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4fe64281c4d480b6632157b5775f1bb463e7f2e03d56df18164bb767ab12641a
MD5 38233695a4fa7bf2d7161485f459a900
BLAKE2b-256 bb8019b73ba8e68ef304099a491a84e98ce69d33b7e935eceb3522e2ca07977d

See more details on using hashes here.

Provenance

The following attestation bundles were made for text_vectorify-1.0.0.tar.gz:

Publisher: python-publish.yml on changyy/py-text-vectorify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file text_vectorify-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: text_vectorify-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 18.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for text_vectorify-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 18e0d6d703ae1812422387dbadb7d1537647aca6d81296877b53901bd2423584
MD5 7828f51b028067a22524076ff6d5fd9d
BLAKE2b-256 f91c639a1e21f012e861b1a5b800a8e0917fdb4c950ad4bf41d2931d57514642

See more details on using hashes here.

Provenance

The following attestation bundles were made for text_vectorify-1.0.0-py3-none-any.whl:

Publisher: python-publish.yml on changyy/py-text-vectorify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page