A Python library for machine learning and corpus linguistics

These details have not been verified by PyPI

Project description

Vivre

A Python library for parsing EPUB files and aligning parallel texts.

Description

Vivre provides tools for processing parallel texts through a complete pipeline: parsing EPUB files, segmenting text into sentences, and aligning sentences between languages using the Gale-Church algorithm. The library offers both a simple API for programmatic use and a powerful command-line interface.

Features

EPUB Parsing: Robust parsing with content filtering and chapter extraction
Sentence Segmentation: Multi-language sentence segmentation using spaCy
Text Alignment: Statistical text alignment using the Gale-Church algorithm
Multiple Output Formats: JSON, CSV, XML, text, and dictionary formats
Language Support: English, Spanish, French, German, Italian, Portuguese, and more
Simple API: Easy-to-use top-level functions for common tasks
Command Line Interface: Clean CLI with two powerful commands
Error Handling: Comprehensive error handling with helpful messages
Type Safety: Full type hints and validation

Getting Started

Prerequisites

Python 3.11 or higher
pip (Python package installer)

Installation

Option 1: Local Installation

Clone the repository:

git clone https://github.com/anidixit64/vivre.git
cd vivre

Install the package:

pip install -e .

Install required spaCy models:

python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download fr_core_news_sm
python -m spacy download it_core_news_sm

Option 2: Docker (Recommended)

Clone the repository:

git clone https://github.com/anidixit64/vivre.git
cd vivre

Build the Docker image:

docker build -t vivre .

Use the helper script for different operations:

# Run test suite (default)
./docker-run.sh

# Drop into interactive shell
./docker-run.sh shell

# Show CLI help
./docker-run.sh cli

# Get help on available options
./docker-run.sh help

The Docker setup includes all dependencies and spaCy models pre-installed.

Usage

Command Line Interface

Vivre provides a clean CLI with two powerful commands:

# Parse and analyze an EPUB file
vivre parse book.epub --verbose

# Parse with content display and segmentation
vivre parse book.epub --show-content --segment --language en

# Parse with custom output format
vivre parse book.epub --format csv --output analysis.csv

# Align two EPUB files (language pair is required)
vivre align english.epub french.epub en-fr

# Align with different output formats
vivre align english.epub french.epub en-fr --format json
vivre align english.epub french.epub en-fr --format csv --output alignments.csv
vivre align english.epub french.epub en-fr --format xml --output alignments.xml

# Align with custom parameters
vivre align english.epub french.epub en-fr --c 1.1 --s2 7.0 --gap-penalty 2.5

# Get help
vivre --help
vivre align --help
vivre parse --help

Quick Start Examples:

# Parse a book and see its structure
vivre parse sample.epub --verbose

# Align English and French versions of the same book
vivre align english_book.epub french_book.epub en-fr --format json --output alignment.json

# Parse with sentence segmentation
vivre parse sample.epub --segment --language en --format csv --output sentences.csv

Simple API

Vivre provides easy-to-use top-level functions for common tasks:

import vivre

# Parse EPUB and extract chapters
chapters = vivre.read('path/to/epub')
print(f"Found {len(chapters)} chapters")

# Segment chapters into sentences
segmented = chapters.segment('en')  # Specify language for better accuracy
sentences = segmented.get_segmented()

# Quick alignment - returns simple sentence pairs
pairs = vivre.quick_align('english.epub', 'french.epub', 'en-fr')
for source, target in pairs[:5]:
    print(f"EN: {source}")
    print(f"FR: {target}")

# Full alignment with rich output
result = vivre.align('english.epub', 'french.epub', 'en-fr')
print(result.to_json())      # JSON output
print(result.to_csv())       # CSV output
print(result.to_text())      # Formatted text
print(result.to_xml())       # XML output
print(result.to_dict())      # Python dictionary

# Work with Chapters objects seamlessly
source_chapters = vivre.read('english.epub')
target_chapters = vivre.read('french.epub')
result = vivre.align(source_chapters, target_chapters, 'en-fr')  # Works with objects too!

# Get supported languages
languages = vivre.get_supported_languages()
print(f"Supported languages: {languages}")

Quick Start Examples:

import vivre

# Parse a book
chapters = vivre.read('sample.epub')
print(f"Book has {len(chapters)} chapters")

# Align two books
result = vivre.align('english.epub', 'french.epub', 'en-fr')
print(result.to_json())

# Get sentence pairs
pairs = vivre.quick_align('english.epub', 'french.epub', 'en-fr')
for en, fr in pairs[:3]:
    print(f"EN: {en}")
    print(f"FR: {fr}")
    print()

Advanced Usage

For more control, you can use the individual components:

from vivre import VivreParser, Segmenter, Aligner

# Parse EPUB
parser = VivreParser()
chapters = parser.parse_epub('book.epub')

# Segment text
segmenter = Segmenter()
sentences = segmenter.segment('Hello world!', 'en')

# Align texts
aligner = Aligner()
alignments = aligner.align(['Hello'], ['Bonjour'])

# Pipeline for complex workflows
from vivre import VivrePipeline
pipeline = VivrePipeline('en-fr')
result = pipeline.process_parallel_epubs('english.epub', 'french.epub')

API Reference

Top-level Functions

read(epub_path) - Parse EPUB and return Chapters object
align(source, target, language_pair) - Align parallel texts, returns AlignmentResult
quick_align(source_epub, target_epub, language_pair) - Simple alignment, returns sentence pairs
get_supported_languages() - Get list of supported language codes

Classes

Chapters - Container for parsed EPUB chapters with segmentation support
AlignmentResult - Container for alignment results with multiple output formats
VivreParser - Low-level EPUB parser
Segmenter - Sentence segmentation using spaCy
Aligner - Text alignment using Gale-Church algorithm
VivrePipeline - High-level pipeline for complete workflows

Output Formats

The library supports multiple output formats:

JSON: Structured data for programmatic use
CSV: Tabular data for spreadsheet applications
XML: Hierarchical data for document processing
Text: Human-readable formatted output
Dict: Python dictionary for direct manipulation

Language Support

Vivre supports the following languages through spaCy models:

English (en_core_web_sm)
Spanish (es_core_news_sm)
French (fr_core_news_sm)
Italian (it_core_news_sm)

These are the languages for which spaCy models are pre-installed and ready to use for EPUB parsing and text segmentation.

Development

Running Tests

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=vivre --cov-report=html

# Run specific test files
pytest tests/test_api.py
pytest tests/test_parser.py

Docker Development

For consistent development environments, use Docker:

# Build the development image
docker build -t vivre .

# Run tests in Docker
docker run --rm vivre python -m pytest tests/ -v

# Interactive development shell
docker run --rm -it vivre /bin/bash

# Run specific test with coverage
docker run --rm vivre python -m pytest tests/test_api.py --cov=src/vivre/api --cov-report=term-missing

Code Quality

The project uses pre-commit hooks for code quality:

# Install pre-commit hooks
pre-commit install

# Run hooks manually
pre-commit run --all-files

Contributing

We welcome contributions! Please see our Contributing Guide for detailed information on how to contribute to this project.

Quick Start for Contributors

Fork the repository on GitHub

Clone your fork locally:

git clone https://github.com/your-username/vivre.git
cd vivre

Create a feature branch:

git checkout -b feature/your-feature-name

Set up development environment:

# Install dependencies
poetry install

# Install pre-commit hooks
pre-commit install

# Install spaCy models
poetry run python -m spacy download en_core_web_sm
poetry run python -m spacy download es_core_news_sm
poetry run python -m spacy download fr_core_news_sm
poetry run python -m spacy download it_core_news_sm

Make your changes and add tests for new functionality

Run tests and quality checks:

# Run all tests
poetry run pytest tests/

# Run with coverage
poetry run pytest tests/ --cov=vivre --cov-report=html

# Run linting and formatting
poetry run ruff check .
poetry run ruff format --check .

# Run type checking
poetry run mypy src/ tests/

Ensure all tests pass and coverage remains >90%
Commit your changes with clear commit messages
Push to your fork and submit a pull request

Development Guidelines

Follow the existing code style and conventions
Add type hints to all new functions
Include docstrings for all public functions and classes
Write tests for new functionality
Update documentation as needed
Ensure all pre-commit hooks pass

For more detailed information, please see our Contributing Guide.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

License Summary

License: Apache License 2.0
SPDX Identifier: Apache-2.0
Permissions: Commercial use, modification, distribution, patent use, private use
Limitations: Liability, warranty
Conditions: License and copyright notice

The Apache License 2.0 is a permissive license that allows for:

Commercial use
Modification
Distribution
Patent use
Private use

While providing liability protection and requiring license and copyright notice preservation.

For the complete license text, please see the LICENSE file in this repository.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jul 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vivre-0.1.0.tar.gz (37.2 kB view details)

Uploaded Jul 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vivre-0.1.0-py3-none-any.whl (38.2 kB view details)

Uploaded Jul 31, 2025 Python 3

File details

Details for the file vivre-0.1.0.tar.gz.

File metadata

Download URL: vivre-0.1.0.tar.gz
Upload date: Jul 31, 2025
Size: 37.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.13.3 Darwin/24.4.0

File hashes

Hashes for vivre-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`de7feeff1aaca0a8bcf074952bf62944078ec1ee583f24f2cd23d48bf013e79a`
MD5	`94820c1a3e78749a957fb5268a1bf6e6`
BLAKE2b-256	`d4833b52e79963f6d99b3e3d447a189fced88cb4f97d13a600c059fcf15e5d7d`

See more details on using hashes here.

File details

Details for the file vivre-0.1.0-py3-none-any.whl.

File metadata

Download URL: vivre-0.1.0-py3-none-any.whl
Upload date: Jul 31, 2025
Size: 38.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.13.3 Darwin/24.4.0

File hashes

Hashes for vivre-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1f63a69119d2fe43faee0b1a013d1b355d2b82cb578f89f40097b9668913b2fd`
MD5	`82350c4ecdf77dde931c1ddcb839d0ad`
BLAKE2b-256	`41cfd03bf64b6f77c689b031d050df9d50e6496b0d2ae10aaaa0730991be2fd2`

See more details on using hashes here.

vivre 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Vivre

Description

Features

Getting Started

Prerequisites

Installation

Option 1: Local Installation

Option 2: Docker (Recommended)

Usage

Command Line Interface

Simple API

Advanced Usage

API Reference

Top-level Functions

Classes

Output Formats

Language Support

Development

Running Tests

Docker Development

Code Quality

Contributing

Quick Start for Contributors

Development Guidelines

License

License Summary

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes