Skip to main content

A powerful CLI tool for data format conversion and synthetic data generation

Project description

PyForge CLI

PyForge CLI
A powerful command-line tool for data format conversion and manipulation, built with Python and Click.

๐Ÿ“– Documentation

๐Ÿ“š Complete Documentation | ๐Ÿš€ Quick Start | ๐Ÿ“ฆ Installation Guide | ๐Ÿ”ง CLI Reference

Features

  • PDF to Text Conversion: Extract text from PDF documents with advanced options
  • Excel to Parquet Conversion: Convert Excel files (.xlsx) to Parquet format with multi-sheet support
  • XML to Parquet Conversion: Intelligent XML flattening with automatic structure detection and configurable strategies
  • Database File Conversion: Convert Microsoft Access (.mdb/.accdb), DBF, and SQL Server MDF files to Parquet
  • MDF Tools Installer: Automated setup of Docker and SQL Server Express for MDF file processing
  • CSV to Parquet Conversion: Smart delimiter detection and encoding handling
  • Rich CLI Interface: Beautiful terminal output with progress bars and tables
  • Intelligent Processing: Automatic encoding detection, structure analysis, and column matching
  • Extensible Architecture: Plugin-based system for adding new format converters
  • Metadata Extraction: Get detailed information about your files
  • Cross-platform: Works on Windows, macOS, and Linux

Installation

Stable Version (Recommended)

pip install pyforge-cli

Development Version

To test the latest features and bug fixes before they're officially released:

# Install from PyPI Test (latest development version)
pip install -i https://test.pypi.org/simple/ pyforge-cli

# Or with dependency fallback
pip install --index-url https://test.pypi.org/simple/ \
           --extra-index-url https://pypi.org/simple/ \
           pyforge-cli

Note: Development versions follow the pattern X.Y.Z.devN+gCOMMIT and are automatically deployed from the main branch.

From Source

git clone https://github.com/Py-Forge-Cli/PyForge-CLI.git
cd PyForge-CLI
pip install -e .

Development Installation

git clone https://github.com/Py-Forge-Cli/PyForge-CLI.git
cd PyForge-CLI
pip install -e ".[dev,test]"

System Dependencies

For MDB/Access file support on non-Windows systems:

# Ubuntu/Debian
sudo apt-get install mdbtools

# macOS
brew install mdbtools

MDF File Processing Setup

For SQL Server MDF file conversion, install the MDF Tools:

# Interactive setup wizard
pyforge install mdf-tools

# Check installation status
pyforge mdf-tools status

This installs Docker Desktop and SQL Server Express automatically. See MDF Tools Documentation for details.

Quick Start

Convert PDF to Text

# Convert entire PDF
pyforge convert document.pdf

# Convert to specific output file
pyforge convert document.pdf output.txt

# Convert specific page range
pyforge convert document.pdf --pages "1-5"

# Include page metadata
pyforge convert document.pdf --metadata

Convert Excel to Parquet

# Convert Excel file to Parquet
pyforge convert data.xlsx

# Convert with specific compression
pyforge convert data.xlsx --compression gzip

# Convert specific sheets only
pyforge convert data.xlsx --sheets "Sheet1,Sheet3"

Convert Database Files

# Convert Access database to Parquet
pyforge convert database.mdb

# Convert DBF file with encoding detection
pyforge convert data.dbf

# Convert with custom output directory
pyforge convert database.accdb output_folder/

Convert CSV Files

# Convert CSV with automatic delimiter and encoding detection
pyforge convert data.csv

# Convert TSV (tab-separated) file
pyforge convert data.tsv

# Convert with compression
pyforge convert sales_data.csv --compression gzip

# Convert delimited text file
pyforge convert export.txt

Convert XML Files

# Convert XML with automatic structure detection
pyforge convert data.xml

# Convert with aggressive flattening for analytics
pyforge convert catalog.xml --flatten-strategy aggressive

# Handle arrays by expanding to multiple rows
pyforge convert orders.xml --array-handling expand

# Strip namespaces for cleaner column names
pyforge convert api_response.xml --namespace-handling strip

# Preview schema before conversion
pyforge convert complex.xml --preview-schema

Get File Information

# Display file metadata as table
pyforge info document.pdf

# Get Excel file information
pyforge info spreadsheet.xlsx

# Output metadata as JSON
pyforge info database.mdb --format json

List Supported Formats

pyforge formats

Validate Files

pyforge validate document.pdf
pyforge validate data.xlsx

MDF File Processing

# Step 1: Install MDF processing tools (one-time setup)
pyforge install mdf-tools

# Step 2: Check installation status
pyforge mdf-tools status

# Step 3: Start SQL Server (if not running)
pyforge mdf-tools start

# Step 4: Convert MDF files (when MDF converter is available)
# pyforge convert database.mdf --format parquet

# Container management
pyforge mdf-tools stop      # Stop SQL Server
pyforge mdf-tools restart   # Restart SQL Server
pyforge mdf-tools logs      # View SQL Server logs
pyforge mdf-tools test      # Test connectivity
pyforge mdf-tools config    # Show configuration
pyforge mdf-tools uninstall # Remove everything

Usage Examples

Basic PDF Conversion

# Convert PDF to text (creates report.txt in same directory)
pyforge convert report.pdf

# Convert with custom output path
pyforge convert report.pdf /path/to/output.txt

# Convert with verbose output
pyforge convert report.pdf --verbose

# Force overwrite existing file
pyforge convert report.pdf output.txt --force

Advanced PDF Options

# Convert pages 1-10
pyforge convert document.pdf --pages "1-10"

# Convert from page 5 to end
pyforge convert document.pdf --pages "5-"

# Convert up to page 10
pyforge convert document.pdf --pages "-10"

# Include page markers in output
pyforge convert document.pdf --metadata

Excel Conversion Examples

# Convert Excel with all sheets
pyforge convert sales_data.xlsx

# Interactive mode - prompts for sheet selection
pyforge convert multi_sheet.xlsx --interactive

# Convert sheets with matching columns into single file
pyforge convert monthly_reports.xlsx --merge-sheets

# Generate summary report
pyforge convert data.xlsx --summary

Database Conversion Examples

# Convert Access database (all tables)
pyforge convert company.mdb

# Convert with progress tracking
pyforge convert large_database.accdb --verbose

# Convert DBF with specific encoding
pyforge convert legacy.dbf --encoding cp1252

# Batch convert all DBF files in directory
for file in *.dbf; do pyforge convert "$file"; done

CSV Conversion Examples

# Convert CSV with automatic detection
pyforge convert sales_data.csv

# Convert international CSV with auto-encoding detection
pyforge convert european_data.csv --verbose

# Convert semicolon-delimited CSV (European format)
pyforge convert data_semicolon.csv

# Convert tab-separated file with compression
pyforge convert data.tsv --compression gzip

# Batch convert multiple CSV files
for file in *.csv; do pyforge convert "$file" --compression snappy; done

XML Conversion Examples

# Convert XML with automatic structure detection
pyforge convert catalog.xml

# Convert with aggressive flattening for data analysis
pyforge convert api_response.xml --flatten-strategy aggressive

# Handle arrays as concatenated strings
pyforge convert orders.xml --array-handling concatenate

# Strip namespaces for cleaner output
pyforge convert soap_response.xml --namespace-handling strip

# Preview structure before conversion
pyforge convert complex_structure.xml --preview-schema

# Convert compressed XML files
pyforge convert data.xml.gz --verbose

# Batch convert XML files with specific strategy
for file in *.xml; do pyforge convert "$file" --flatten-strategy moderate; done

File Information

# Show file metadata
pyforge info document.pdf

# Excel file details (sheets, row counts)
pyforge info spreadsheet.xlsx

# Database file information (tables, record counts)
pyforge info database.mdb

# Export metadata as JSON
pyforge info document.pdf --format json > metadata.json

Supported Formats

Input Format Output Formats Status
PDF (.pdf) Text (.txt) โœ… Available
Excel (.xlsx) Parquet (.parquet) โœ… Available
XML (.xml, .xml.gz, .xml.bz2) Parquet (.parquet) โœ… Available
Access (.mdb/.accdb) Parquet (.parquet) โœ… Available
DBF (.dbf) Parquet (.parquet) โœ… Available
CSV (.csv, .tsv, .txt) Parquet (.parquet) โœ… Available

Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/Py-Forge-Cli/PyForge-CLI.git
cd PyForge-CLI

# Set up development environment
pip install -e ".[dev,test]"

# Run tests
pytest

# Format code
black src tests

# Run all checks
ruff check src tests

Development Commands

# Testing
pytest                                    # Run tests
pytest --cov=pyforge_cli                 # Run tests with coverage

# Code Quality
black src tests                          # Format code
ruff check src tests                     # Run linting
mypy src                                 # Type checking

# Building
python -m build                          # Build distribution packages
twine upload dist/*                      # Publish to PyPI

Project Structure

PyForge-CLI/
โ”œโ”€โ”€ src/pyforge_cli/            # Main package
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ main.py                 # CLI entry point
โ”‚   โ”œโ”€โ”€ converters/             # Format converters
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ base.py            # Base converter class
โ”‚   โ”‚   โ”œโ”€โ”€ pdf_converter.py   # PDF to text conversion
โ”‚   โ”‚   โ”œโ”€โ”€ excel_converter.py # Excel to Parquet conversion
โ”‚   โ”‚   โ”œโ”€โ”€ mdb_converter.py   # MDB/ACCDB to Parquet conversion
โ”‚   โ”‚   โ””โ”€โ”€ dbf_converter.py   # DBF to Parquet conversion
โ”‚   โ”œโ”€โ”€ plugins/               # Plugin system
โ”‚   โ”‚   โ””โ”€โ”€ loader.py          # Plugin loading
โ”‚   โ””โ”€โ”€ utils/                 # Utilities
โ”‚       โ”œโ”€โ”€ file_utils.py      # File operations
โ”‚       โ””โ”€โ”€ cli_utils.py       # CLI helpers
โ”œโ”€โ”€ docs/                      # Documentation source
โ”œโ”€โ”€ tests/                     # Test files
โ”œโ”€โ”€ pyproject.toml            # Project configuration
โ””โ”€โ”€ README.md                 # This file

Requirements

  • Python 3.8+
  • PyMuPDF (for PDF processing)
  • Click (for CLI interface)
  • Rich (for beautiful terminal output)
  • Pandas & PyArrow (for data processing and Parquet support)
  • pandas-access (for MDB file support)
  • dbfread (for DBF file support)
  • openpyxl (for Excel file support)
  • chardet (for encoding detection)

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests and linting (pytest && ruff check src tests)
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Roadmap

Version 0.2.0 - Database & Spreadsheet Support (Completed)

  • โœ… Excel to Parquet Conversion
    • Multi-sheet support with intelligent detection
    • Interactive sheet selection mode
    • Column matching for combined output
    • Progress tracking and summary reports
  • โœ… MDB/ACCDB to Parquet Conversion
    • Microsoft Access database support (.mdb, .accdb)
    • Automatic table discovery
    • Cross-platform compatibility (Windows/Linux/macOS)
    • Excel summary reports with sample data
  • โœ… DBF to Parquet Conversion
    • Automatic encoding detection
    • Support for various DBF formats
    • Robust error handling for corrupted files

Version 0.3.0 - Enhanced Features (Released)

  • โœ… XML to Parquet Conversion
    • Automatic XML structure detection and analysis
    • Intelligent flattening strategies (conservative, moderate, aggressive)
    • Array handling modes (expand, concatenate, json_string)
    • Namespace processing (preserve, strip, prefix)
    • Schema preview before conversion
    • Support for compressed XML files (.xml.gz, .xml.bz2)
  • โœ… CSV to Parquet conversion with auto-detection (string-based)
  • CSV schema inference and native type conversion
  • JSON processing and flattening
  • Data validation and cleaning options
  • Batch processing with pattern matching
  • Configuration file support
  • REST API wrapper for notebook integration
  • Data type preservation options (beyond string conversion)

Version 0.4.0 - MDF Tools Infrastructure (Completed)

  • โœ… MDF Tools Installer
    • Automated Docker Desktop installation (macOS/Windows/Linux)
    • SQL Server Express 2019 container setup
    • Interactive setup wizard with non-interactive mode
    • Cross-platform compatibility with system package managers
  • โœ… Container Management Commands
    • 9 lifecycle commands: status, start, stop, restart, logs, test, config, uninstall
    • Rich terminal UI with status tables and progress tracking
    • Configuration management with JSON persistence
    • Health monitoring and connectivity testing
  • โœ… Infrastructure Foundation
    • Complete Docker and SQL Server environment for MDF processing
    • Foundation ready for MDF file conversion implementation
    • Comprehensive documentation with live terminal examples
    • ASCII architecture diagrams and troubleshooting guides

Version 0.5.0 - MDF File Conversion (Planned)

  • MDF to Parquet Converter (Issue #13)
    • SQL Server MDF file attachment and processing
    • 6-stage conversion process (matching MDB converter)
    • String-only data conversion for Phase 1 consistency
    • Batch processing with progress tracking
    • Excel summary reports with conversion statistics

Version 0.5.1 - Test Datasets Collection (Planned)

  • Sample Datasets Installation (Issue #15)
    • CLI command: pyforge install sample-datasets
    • Curated test datasets for all formats (XML, MDF, DBF, MDB, CSV, PDF)
    • Multiple size categories: small (<1MB), medium (1-100MB), large (100MB-3GB)
    • Direct download from online sources with metadata
    • Organized directory structure for testing scenarios

Version 0.6.0 - Advanced Features (Future)

  • SQL query support for database files
  • Data transformation pipelines
  • Cloud storage integration (S3, Azure Blob)
  • Incremental/delta conversions
  • Custom plugin development SDK

Support

If you encounter any issues or have questions:

  1. Check the ๐Ÿ“š Complete Documentation
  2. Search existing issues
  3. Create a new issue
  4. Join the discussion

Changelog

1.0.8 (Current Release)

  • โœ… Complete Testing Infrastructure Overhaul: Fixed 13 major issues across infrastructure and notebooks
  • โœ… Sample Datasets Installation: Fixed with intelligent fallback versioning system
  • โœ… Missing Dependencies: Added PyMuPDF, chardet, requests to resolve import errors
  • โœ… Convert Command Fix: Resolved TypeError in ConverterRegistry API
  • โœ… Comprehensive Testing Framework: Created systematic testing with 402 lines of test code
  • โœ… Notebook Organization: Restructured with proper unit/integration/functional hierarchy
  • โœ… Cross-Environment Support: Both local and Databricks notebooks fully functional
  • โœ… Enhanced Error Handling: Smart file selection, directory creation, PDF skip logic
  • โœ… Developer Documentation: Complete guides and deployment documentation

0.4.0

  • โœ… MDF Tools Installer: Complete automated Docker Desktop and SQL Server Express 2019 setup
  • โœ… Cross-Platform Installation: System package managers (Homebrew, Winget, apt/yum)
  • โœ… Container Management: 9 lifecycle commands for SQL Server operations
  • โœ… Interactive Setup Wizard: User-guided installation with smart detection
  • โœ… Rich Terminal UI: Beautiful status tables, progress bars, and error handling
  • โœ… Critical Docker Fix: Proper system package manager installation (not pip)
  • โœ… Infrastructure Foundation: Complete environment for MDF file processing
  • โœ… Comprehensive Documentation: Live terminal examples and ASCII diagrams

0.3.0

  • โœ… XML to Parquet Converter: Complete implementation with intelligent flattening
  • โœ… Automatic Structure Detection: Analyzes XML hierarchy and array patterns
  • โœ… Flexible Flattening Strategies: Conservative, moderate, and aggressive options
  • โœ… Advanced Array Handling: Expand, concatenate, or JSON string modes
  • โœ… Namespace Support: Configurable namespace processing
  • โœ… Schema Preview: Optional structure preview before conversion
  • โœ… Comprehensive Documentation: User guide and quick reference
  • โœ… Compressed XML Support: Handles .xml.gz and .xml.bz2 files

0.2.1

  • โœ… Complete Documentation Site: Comprehensive GitHub Pages documentation
  • โœ… Fixed CI/CD: GitHub Actions workflow for automated PyPI publishing
  • โœ… Improved Distribution: API token authentication and automation
  • โœ… Better Navigation: Fixed broken links and improved project structure

0.2.0

  • โœ… Excel to Parquet conversion with multi-sheet support
  • โœ… MDB/ACCDB to Parquet conversion with cross-platform support
  • โœ… DBF to Parquet conversion with encoding detection
  • โœ… Interactive mode for Excel sheet selection
  • โœ… Automatic table discovery for database files
  • โœ… Progress tracking with rich terminal UI
  • โœ… Excel summary reports for batch conversions
  • โœ… Robust error handling and recovery

0.1.0 (Initial Release)

  • PDF to text conversion
  • CLI interface with Click
  • Rich terminal output
  • File metadata extraction
  • Page range support
  • Development tooling setup

๐Ÿš€ Development & Deployment

Automated Versioning

PyForge CLI uses automated versioning with setuptools-scm:

  • Development versions: 1.0.x.devN - Auto-deployed to PyPI Test on every commit to main
  • Release versions: 1.0.x - Deployed to PyPI when Git tags are created
  • Version increment: Development versions auto-increment on each commit (dev1 โ†’ dev2 โ†’ dev3...)

Installation from Test Repository

# Install latest development version
pip install -i https://test.pypi.org/simple/ pyforge-cli

# Install specific development version  
pip install -i https://test.pypi.org/simple/ pyforge-cli==1.0.8.dev5

CI/CD Pipeline

  • Trigger: Every push to main branch automatically builds and deploys to PyPI Test
  • Release: Create a Git tag to trigger deployment to PyPI Production
  • Testing: Use pyforge-cli from test.pypi.org for validation before release

Recent Updates (v1.0.8)

Complete Testing Infrastructure Overhaul - Fixed 13 major issues across PyForge CLI infrastructure and testing notebooks:

  • โœ… Infrastructure Fixes: Sample datasets installation, missing dependencies, convert command API
  • โœ… Testing Framework: Comprehensive testing suite with 402 lines of test code
  • โœ… Notebook Support: Full local and Databricks notebook functionality
  • โœ… Error Handling: Smart file selection, directory creation, PDF skip logic
  • โœ… Documentation: Complete developer guides and deployment processes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyforge_cli-1.0.9.tar.gz (7.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyforge_cli-1.0.9-py3-none-any.whl (6.8 MB view details)

Uploaded Python 3

File details

Details for the file pyforge_cli-1.0.9.tar.gz.

File metadata

  • Download URL: pyforge_cli-1.0.9.tar.gz
  • Upload date:
  • Size: 7.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pyforge_cli-1.0.9.tar.gz
Algorithm Hash digest
SHA256 b05224398454c7c3901ddddf604ae4f3d6e809c32aa31270dd18d94921408425
MD5 afc1f5a6459b68647df5fba192e5ae95
BLAKE2b-256 db708098593e06efc63eacc57823a32b42ce6f0816bbffd82442a23a92368f62

See more details on using hashes here.

File details

Details for the file pyforge_cli-1.0.9-py3-none-any.whl.

File metadata

  • Download URL: pyforge_cli-1.0.9-py3-none-any.whl
  • Upload date:
  • Size: 6.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pyforge_cli-1.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 057e69b7203c8ed609a6c17b16cc4df70f32fea25e81df206d18bb5ff7786f4d
MD5 7dbee674e9e92e4284dc633c4b5ee6ee
BLAKE2b-256 9ec3ca49b4b410c7a828d795c0e36289b5114671ff2efcd12e80243e8225e063

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page