Skip to main content

A Python library for easily generating low-dimensional vector embeddings from any tabular dataset.

Project description

Row2Vec

License: MIT Python 3.10+

Row2Vec is a Python library for easily generating low-dimensional vector embeddings from any tabular dataset. It uses deep learning and classical methods to create powerful, dense representations of your data, suitable for visualization, feature engineering, and gaining deeper insights into your data's structure.

Features

🎯 Multiple Embedding Methods

  • Neural (Autoencoder): Deep learning approach for complex, non-linear patterns
  • Target-based: Learn embeddings for categorical columns and their relationships
  • PCA: Fast, linear dimensionality reduction with interpretable components
  • t-SNE: Excellent for 2D/3D visualization and cluster discovery
  • UMAP: Balanced preservation of local and global structure

🧠 Intelligent Preprocessing

  • Adaptive Missing Value Imputation: Automatically analyzes patterns and applies optimal strategies
  • Pattern-Aware Analysis: Detects problematic missing patterns with configurable strategies
  • Automated Feature Engineering: Handles scaling, encoding, and preprocessing seamlessly

🚀 Advanced Features

  • Neural Architecture Search (NAS): Automatically discovers optimal network architectures
  • Multi-layer Networks: Support for deep architectures with dropout and regularization
  • Model Serialization: Save and load models with full preprocessing pipelines
  • Command-Line Interface: Complete CLI for batch processing and automation

🔧 Production Ready

  • Comprehensive Testing: 163+ test functions across 17 test files
  • Type Safety: Complete MyPy annotations
  • Modern Build System: Uses pyproject.toml with hatchling backend
  • Documentation: Interactive Jupyter Book with executable examples

Installation

pip install row2vec

Quick Start

import pandas as pd
from row2vec import learn_embedding, generate_synthetic_data

# Load your data
df = generate_synthetic_data(num_records=1000)

# Generate neural embeddings for each row
embeddings = learn_embedding(
    df,
    mode="unsupervised",
    embedding_dim=5
)
print(f"Embeddings shape: {embeddings.shape}")
print(embeddings.head())

# Learn categorical embeddings
country_embeddings = learn_embedding(
    df,
    mode="target",
    reference_column="Country",
    embedding_dim=3
)
print(f"Country embeddings: {country_embeddings}")

# Compare with classical methods
pca_embeddings = learn_embedding(df, mode="pca", embedding_dim=5)
tsne_embeddings = learn_embedding(df, mode="tsne", embedding_dim=2)

Command Line Interface

# Quick embeddings
row2vec annotate --input data.csv --output embeddings.csv --mode unsupervised --dim 5

# Train and save model
row2vec train --input data.csv --output model.py --mode unsupervised --dim 10 --epochs 50

# Use saved model
row2vec predict --input new_data.csv --model model.py --output predictions.csv

# Target-based embeddings
row2vec annotate --input data.csv --output categories.csv --mode target --target-col Category --dim 3

Advanced Usage

Neural Architecture Search

from row2vec import ArchitectureSearchConfig, search_architecture, EmbeddingConfig, NeuralConfig

# Configure architecture search
config = ArchitectureSearchConfig(
    method='random',
    max_layers=3,
    width_options=[64, 128, 256],
    max_trials=20
)

base_config = EmbeddingConfig(
    mode="unsupervised",
    embedding_dim=8,
    neural=NeuralConfig(max_epochs=50)
)

# Find optimal architecture
best_arch, results = search_architecture(df, base_config, config)
print(f"Best architecture: {best_arch}")

# Train with optimal settings
optimal_embeddings = learn_embedding(
    df,
    mode="unsupervised",
    embedding_dim=8,
    hidden_units=best_arch.get('hidden_units', [128]),
    max_epochs=100
)

Missing Value Imputation

from row2vec import ImputationConfig, AdaptiveImputer, MissingPatternAnalyzer

# Analyze missing patterns
analyzer = MissingPatternAnalyzer(ImputationConfig())
analysis = analyzer.analyze(df)
print(f"Missing patterns: {analysis['recommendations']}")

# Apply adaptive imputation
imputer = AdaptiveImputer(ImputationConfig(
    numeric_strategy='knn',
    categorical_strategy='mode',
    knn_neighbors=10
))
df_clean = imputer.fit_transform(df)

Documentation

Online Documentation

Local Documentation

  • User Guide: Comprehensive guide with mathematical background, detailed examples, and best practices
  • LLM Documentation: Practical guide for LLM coding agents integrating Row2Vec
  • API Reference: Complete function and class reference
  • Tutorials: Executable Python tutorials (Nhandu format) - run make docs to build HTML

Why Row2Vec?

Method Row2Vec Advantage Alternative
Manual Neural Networks Automated preprocessing, simple API 200+ lines of boilerplate
sklearn PCA Integrated preprocessing, multiple methods Limited to linear reduction
sklearn t-SNE/UMAP Unified interface, consistent preprocessing Manual pipeline setup
Custom Embeddings Production-ready with serialization Significant development time

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Citation

If you use Row2Vec in your research, please cite:

@software{tresoldi_row2vec,
  author = {Tresoldi, Tiago},
  title = {Row2Vec: Neural and Classical Embeddings for Tabular Data},
  url = {https://github.com/evotext/row2vec},
  version = {1.0.0}
}

Acknowledgments

This library was originally developed as part of the "Cultural Evolution of Texts" project, led by Michael Dunn at the Department of Linguistics and Philology, Uppsala University. The project investigates the application of evolutionary models to textual data and cultural transmission patterns.

Authors

Tiago Tresoldi Affiliate Researcher, Department of Linguistics and Philology Uppsala University GitHub: @tresoldi

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

row2vec-0.1.0.tar.gz (106.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

row2vec-0.1.0-py3-none-any.whl (79.5 kB view details)

Uploaded Python 3

File details

Details for the file row2vec-0.1.0.tar.gz.

File metadata

  • Download URL: row2vec-0.1.0.tar.gz
  • Upload date:
  • Size: 106.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for row2vec-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2f193a742d37742f2a9cd6b626a2d2102b20efea97fc4b7b1c02e5a454f5445f
MD5 558316079c676de804bc5989eb60f272
BLAKE2b-256 8478562ca6d78a5134b766e3a4ad25f1b2dc5b24a4fba86410b4811ed9c87194

See more details on using hashes here.

File details

Details for the file row2vec-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: row2vec-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 79.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for row2vec-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9777d3ca8b3cb07793813c06eee5bbd23aeccfde9da3b1cb7c0eacc4496f1892
MD5 32c5f4299a278203b84fe8b92596ea2b
BLAKE2b-256 698080947bed71add35a5887a59f66c56cf56ed29f2e828de111ac34e79266a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page