A Python library for machine learning and corpus linguistics
Project description
Vivre
A Python library for parsing EPUB files and aligning parallel texts.
Description
Vivre provides tools for processing parallel texts through a complete pipeline: parsing EPUB files, segmenting text into sentences, and aligning sentences between languages using the Gale-Church algorithm. The library offers both a simple API for programmatic use and a powerful command-line interface.
Features
- EPUB Parsing: Robust parsing with content filtering and chapter extraction
- Sentence Segmentation: Multi-language sentence segmentation using spaCy
- Text Alignment: Statistical text alignment using the Gale-Church algorithm
- Multiple Output Formats: JSON, CSV, XML, text, and dictionary formats
- Language Support: English, Spanish, French, German, Italian, Portuguese, and more
- Simple API: Easy-to-use top-level functions for common tasks
- Command Line Interface: Clean CLI with two powerful commands
- Error Handling: Comprehensive error handling with helpful messages
- Type Safety: Full type hints and validation
Getting Started
Prerequisites
- Python 3.11 or higher
- pip (Python package installer)
Installation
Option 1: Local Installation
- Clone the repository:
git clone https://github.com/anidixit64/vivre.git
cd vivre
- Install the package:
pip install -e .
- Install required spaCy models:
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download fr_core_news_sm
python -m spacy download it_core_news_sm
Option 2: Docker (Recommended)
- Clone the repository:
git clone https://github.com/anidixit64/vivre.git
cd vivre
- Build the Docker image:
docker build -t vivre .
- Use the helper script for different operations:
# Run test suite (default)
./docker-run.sh
# Drop into interactive shell
./docker-run.sh shell
# Show CLI help
./docker-run.sh cli
# Get help on available options
./docker-run.sh help
The Docker setup includes all dependencies and spaCy models pre-installed.
Usage
Command Line Interface
Vivre provides a clean CLI with two powerful commands:
# Parse and analyze an EPUB file
vivre parse book.epub --verbose
# Parse with content display and segmentation
vivre parse book.epub --show-content --segment --language en
# Parse with custom output format
vivre parse book.epub --format csv --output analysis.csv
# Align two EPUB files (language pair is required)
vivre align english.epub french.epub en-fr
# Align with different output formats
vivre align english.epub french.epub en-fr --format json
vivre align english.epub french.epub en-fr --format csv --output alignments.csv
vivre align english.epub french.epub en-fr --format xml --output alignments.xml
# Align with custom parameters
vivre align english.epub french.epub en-fr --c 1.1 --s2 7.0 --gap-penalty 2.5
# Get help
vivre --help
vivre align --help
vivre parse --help
Quick Start Examples:
# Parse a book and see its structure
vivre parse sample.epub --verbose
# Align English and French versions of the same book
vivre align english_book.epub french_book.epub en-fr --format json --output alignment.json
# Parse with sentence segmentation
vivre parse sample.epub --segment --language en --format csv --output sentences.csv
Simple API
Vivre provides easy-to-use top-level functions for common tasks:
import vivre
# Parse EPUB and extract chapters
chapters = vivre.read('path/to/epub')
print(f"Found {len(chapters)} chapters")
# Segment chapters into sentences
segmented = chapters.segment('en') # Specify language for better accuracy
sentences = segmented.get_segmented()
# Quick alignment - returns simple sentence pairs
pairs = vivre.quick_align('english.epub', 'french.epub', 'en-fr')
for source, target in pairs[:5]:
print(f"EN: {source}")
print(f"FR: {target}")
# Full alignment with rich output
result = vivre.align('english.epub', 'french.epub', 'en-fr')
print(result.to_json()) # JSON output
print(result.to_csv()) # CSV output
print(result.to_text()) # Formatted text
print(result.to_xml()) # XML output
print(result.to_dict()) # Python dictionary
# Work with Chapters objects seamlessly
source_chapters = vivre.read('english.epub')
target_chapters = vivre.read('french.epub')
result = vivre.align(source_chapters, target_chapters, 'en-fr') # Works with objects too!
# Get supported languages
languages = vivre.get_supported_languages()
print(f"Supported languages: {languages}")
Quick Start Examples:
import vivre
# Parse a book
chapters = vivre.read('sample.epub')
print(f"Book has {len(chapters)} chapters")
# Align two books
result = vivre.align('english.epub', 'french.epub', 'en-fr')
print(result.to_json())
# Get sentence pairs
pairs = vivre.quick_align('english.epub', 'french.epub', 'en-fr')
for en, fr in pairs[:3]:
print(f"EN: {en}")
print(f"FR: {fr}")
print()
Advanced Usage
For more control, you can use the individual components:
from vivre import VivreParser, Segmenter, Aligner
# Parse EPUB
parser = VivreParser()
chapters = parser.parse_epub('book.epub')
# Segment text
segmenter = Segmenter()
sentences = segmenter.segment('Hello world!', 'en')
# Align texts
aligner = Aligner()
alignments = aligner.align(['Hello'], ['Bonjour'])
# Pipeline for complex workflows
from vivre import VivrePipeline
pipeline = VivrePipeline('en-fr')
result = pipeline.process_parallel_epubs('english.epub', 'french.epub')
API Reference
Top-level Functions
read(epub_path)- Parse EPUB and return Chapters objectalign(source, target, language_pair)- Align parallel texts, returns AlignmentResultquick_align(source_epub, target_epub, language_pair)- Simple alignment, returns sentence pairsget_supported_languages()- Get list of supported language codes
Classes
Chapters- Container for parsed EPUB chapters with segmentation supportAlignmentResult- Container for alignment results with multiple output formatsVivreParser- Low-level EPUB parserSegmenter- Sentence segmentation using spaCyAligner- Text alignment using Gale-Church algorithmVivrePipeline- High-level pipeline for complete workflows
Output Formats
The library supports multiple output formats:
- JSON: Structured data for programmatic use
- CSV: Tabular data for spreadsheet applications
- XML: Hierarchical data for document processing
- Text: Human-readable formatted output
- Dict: Python dictionary for direct manipulation
Language Support
Vivre supports the following languages through spaCy models:
- English (
en_core_web_sm) - Spanish (
es_core_news_sm) - French (
fr_core_news_sm) - Italian (
it_core_news_sm)
These are the languages for which spaCy models are pre-installed and ready to use for EPUB parsing and text segmentation.
Development
Running Tests
# Run all tests
pytest tests/
# Run with coverage
pytest tests/ --cov=vivre --cov-report=html
# Run specific test files
pytest tests/test_api.py
pytest tests/test_parser.py
Docker Development
For consistent development environments, use Docker:
# Build the development image
docker build -t vivre .
# Run tests in Docker
docker run --rm vivre python -m pytest tests/ -v
# Interactive development shell
docker run --rm -it vivre /bin/bash
# Run specific test with coverage
docker run --rm vivre python -m pytest tests/test_api.py --cov=src/vivre/api --cov-report=term-missing
Code Quality
The project uses pre-commit hooks for code quality:
# Install pre-commit hooks
pre-commit install
# Run hooks manually
pre-commit run --all-files
Contributing
We welcome contributions! Please see our Contributing Guide for detailed information on how to contribute to this project.
Quick Start for Contributors
- Fork the repository on GitHub
- Clone your fork locally:
git clone https://github.com/your-username/vivre.git cd vivre
- Create a feature branch:
git checkout -b feature/your-feature-name
- Set up development environment:
# Install dependencies poetry install # Install pre-commit hooks pre-commit install # Install spaCy models poetry run python -m spacy download en_core_web_sm poetry run python -m spacy download es_core_news_sm poetry run python -m spacy download fr_core_news_sm poetry run python -m spacy download it_core_news_sm
- Make your changes and add tests for new functionality
- Run tests and quality checks:
# Run all tests poetry run pytest tests/ # Run with coverage poetry run pytest tests/ --cov=vivre --cov-report=html # Run linting and formatting poetry run ruff check . poetry run ruff format --check . # Run type checking poetry run mypy src/ tests/
- Ensure all tests pass and coverage remains >90%
- Commit your changes with clear commit messages
- Push to your fork and submit a pull request
Development Guidelines
- Follow the existing code style and conventions
- Add type hints to all new functions
- Include docstrings for all public functions and classes
- Write tests for new functionality
- Update documentation as needed
- Ensure all pre-commit hooks pass
For more detailed information, please see our Contributing Guide.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
License Summary
- License: Apache License 2.0
- SPDX Identifier: Apache-2.0
- Permissions: Commercial use, modification, distribution, patent use, private use
- Limitations: Liability, warranty
- Conditions: License and copyright notice
The Apache License 2.0 is a permissive license that allows for:
- Commercial use
- Modification
- Distribution
- Patent use
- Private use
While providing liability protection and requiring license and copyright notice preservation.
For the complete license text, please see the LICENSE file in this repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vivre-0.1.0.tar.gz.
File metadata
- Download URL: vivre-0.1.0.tar.gz
- Upload date:
- Size: 37.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.3 Darwin/24.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de7feeff1aaca0a8bcf074952bf62944078ec1ee583f24f2cd23d48bf013e79a
|
|
| MD5 |
94820c1a3e78749a957fb5268a1bf6e6
|
|
| BLAKE2b-256 |
d4833b52e79963f6d99b3e3d447a189fced88cb4f97d13a600c059fcf15e5d7d
|
File details
Details for the file vivre-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vivre-0.1.0-py3-none-any.whl
- Upload date:
- Size: 38.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.3 Darwin/24.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f63a69119d2fe43faee0b1a013d1b355d2b82cb578f89f40097b9668913b2fd
|
|
| MD5 |
82350c4ecdf77dde931c1ddcb839d0ad
|
|
| BLAKE2b-256 |
41cfd03bf64b6f77c689b031d050df9d50e6496b0d2ae10aaaa0730991be2fd2
|