A tool to convert text files to Parquet format
Project description
Parquet Converter
A Python utility to convert TXT and CSV files to Parquet format. Developed by Sami Adnan.
Overview
Parquet Converter is a command-line tool that allows you to convert text-based data files (TXT and CSV) to the Parquet format. It provides options for batch processing, detailed conversion statistics, and flexible configuration.
This project is part of Sami Adnan's DPhil research at the Nuffield Department of Primary Care Health Sciences, University of Oxford.
Features
- Convert individual files or entire directories of TXT and CSV files to Parquet
- Automatic file type detection based on extension
- Configurable delimiters for CSV and TXT files
- Generate detailed conversion statistics and reports in JSON format
- Flexible output path configuration with automatic directory creation
- Support for secure configuration via:
- Environment variables
- JSON/YAML config files
- Command-line arguments
- Automatic data type inference for:
- Integers (int32, int64)
- Floats (float32, float64)
- Dates (with custom format support)
- Booleans
- Strings
- Configurable parsing options:
- File encoding
- Header row position
- NA value handling
- Memory usage optimization
- Environment variable support for key configuration options
- Comprehensive logging system:
- Console output
- File logging
- Different log levels
- Formatted statistics tables
- Conversion statistics and reports:
- JSON format reports
- Success/failure tracking
- Row and column statistics
- Error and warning logging
- Progress tracking for batch conversions with visual progress bars
- Pre-commit hooks for code quality:
- Black for code formatting
- isort for import sorting
- Flake8 for linting
- mypy for type checking
- Performance optimization options:
- Configurable compression (snappy, gzip, etc.)
- Memory usage control
- Chunk-based processing for large files
Installation
From PyPI
pip install parquet-converter
From GitHub
# Clone the repository
git clone https://github.com/sami5001/parquet-converter
cd parquet-converter
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e .
Development Setup
For more detailed setup instructions, see README-setup.md.
# Install development dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install
Usage
Command Line Interface
# Basic Usage
# Convert a single file
parquet-converter input.csv -o output_dir/
# Convert a directory of files
parquet-converter input_dir/ -o output_dir/
# Advanced Usage
# Use a configuration file
parquet-converter input.csv -c config.yaml
# Enable verbose logging
parquet-converter input.csv -v
# Save current configuration to a file
parquet-converter input.csv --save-config my_config.yaml
# Convert with custom output directory
parquet-converter input.csv -o /path/to/output/
# Convert multiple file types in a directory
parquet-converter data_dir/ -o output_dir/ -c config.yaml
# Convert with verbose logging and custom config
parquet-converter input.csv -v -c custom_config.yaml -o output_dir/
Example Configuration Files
- Basic YAML Configuration:
# config.yaml
csv:
delimiter: ","
encoding: "utf-8"
header: 0
txt:
delimiter: "\t"
encoding: "utf-8"
header: 0
datetime_formats:
default: "%Y-%m-%d"
custom: ["%d/%m/%Y", "%Y-%m-%d %H:%M:%S"]
infer_dtypes: true
compression: "snappy"
log_level: "INFO"
- Advanced Configuration with Performance Settings:
# advanced_config.yaml
csv:
delimiter: ","
encoding: "utf-8"
header: 0
low_memory: true
chunk_size: 10000
txt:
delimiter: "\t"
encoding: "utf-8"
header: 0
low_memory: true
chunk_size: 10000
datetime_formats:
default: "%Y-%m-%d"
custom: ["%d/%m/%Y", "%Y-%m-%d %H:%M:%S"]
infer_dtypes: true
compression:
type: "snappy"
level: 1
block_size: "128MB"
log_level: "DEBUG"
log_file: "conversion.log"
parallel:
enabled: true
max_workers: 4
Environment Variables
You can configure the converter using environment variables:
# Set input and output paths
export INPUT_PATH="data.csv"
export OUTPUT_DIR="output/"
# Configure logging
export LOG_LEVEL="DEBUG"
export LOG_FILE="conversion.log"
# Set file-specific options
export DELIMITER=","
export ENCODING="utf-8"
Parquet Format Benefits
The Parquet format offers several advantages:
- Columnar storage format enabling efficient querying and compression
- Reduced I/O operations when querying specific columns
- Efficient data encoding and compression schemes
- Compatible with various big data tools (Spark, Hadoop, etc.)
Environment Variables Reference
Available environment variables for configuration:
INPUT_PATH: Path to input file/directoryOUTPUT_DIR: Output directory pathLOG_LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)LOG_FILE: Path to log fileDELIMITER: Custom delimiter for text files
Additional Resources
For more information on Parquet, please refer to:
Example Workflows
- Basic File Conversion:
# Convert a single CSV file
parquet-converter data.csv -o output/
- Batch Processing:
# Convert all files in a directory
parquet-converter data_dir/ -o output/ -c config.yaml
- Debug Mode:
# Convert with detailed logging
parquet-converter data.csv -v -o output/
- Custom Configuration:
# Save current settings as config
parquet-converter data.csv --save-config my_config.yaml
# Use custom config
parquet-converter data.csv -c my_config.yaml -o output/
- Performance-Optimized Conversion:
# Convert large files with memory optimization
parquet-converter large_data.csv -c performance_config.yaml -o output/
Output
The converter generates:
- Parquet files with inferred data types
- Conversion statistics in JSON format
- Detailed logs with conversion summary
- Progress indicators for batch operations
Example conversion report:
{
"timestamp": "2024-03-14T12:00:00",
"summary": {
"total_files": 2,
"successful": 2,
"failed": 0
},
"files": [
{
"input_file": "data.csv",
"output_file": "data.parquet",
"rows_processed": 1000,
"rows_converted": 1000,
"success": true
}
]
}
Development
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=parquet_converter
# Run specific test file
pytest parquet_converter/tests/test_converter.py
Code Quality
The project uses several tools to maintain code quality:
- Black for code formatting
- isort for import sorting
- Flake8 for linting
- mypy for type checking
These are enforced using pre-commit hooks.
Performance Considerations
The converter is optimized for performance with the following features:
- Efficient memory management for large files
- Parallel processing for batch conversions
- Optimized data type inference
- Configurable compression options
Best Practices
-
File Size Optimization
- For files > 1GB, consider using the
low_memory=Trueoption - Use appropriate compression (snappy for speed, gzip for size)
- Process large files in chunks when possible
- For files > 1GB, consider using the
-
Memory Usage
- Monitor memory usage during conversion
- Use appropriate chunk sizes for large files
- Consider system resources when processing multiple files
-
Performance Tuning
- Adjust compression settings based on your needs
- Use appropriate data types to minimize memory usage
- Consider using SSD storage for better I/O performance
-
Batch Processing
- Use directory conversion for multiple files
- Monitor system resources during batch operations
- Consider using parallel processing for large batches
To run performance tests:
pytest parquet_converter/tests/test_performance.py -v
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Contributions to the Parquet Converter project are welcome! Please refer to CONTRIBUTING.md for detailed guidelines on how to contribute.
Briefly, please follow these steps to contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run the tests to ensure everything works (
pytest) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Acknowledgements
- Sami Adnan, Nuffield Department of Primary Care Health Sciences, University of Oxford
- Apache Parquet, PyArrow, and pandas development teams
Troubleshooting
Common Issues
-
Memory Errors
- If you encounter memory errors with large files, try:
- Using
low_memory=Truein configuration - Reducing chunk size
- Processing files in smaller batches
- Using
- If you encounter memory errors with large files, try:
-
Encoding Issues
- If you see encoding errors, try:
- Specifying the correct encoding in config (e.g., 'utf-8', 'latin1')
- Using
encoding='utf-8-sig'for files with BOM
- If you see encoding errors, try:
-
Performance Issues
- If conversion is slow:
- Check if compression settings are appropriate
- Consider using SSD storage
- Adjust chunk size based on available memory
- If conversion is slow:
Error Messages
Common error messages and their solutions:
ValueError: Could not infer delimiter
Solution: Specify delimiter in config file or use --delimiter option
MemoryError: Unable to allocate array
Solution: Use low_memory=True or reduce chunk size
UnicodeDecodeError: 'utf-8' codec can't decode byte
Solution: Specify correct encoding in config file
Advanced Usage
Custom Data Type Inference
You can customize data type inference by modifying the config:
data_types:
integers:
- int32
- int64
floats:
- float32
- float64
dates:
- date
- datetime
booleans:
- bool
strings:
- string
- category
Parallel Processing
For batch processing, you can enable parallel processing:
parallel:
enabled: true
max_workers: 4
chunk_size: 10000
Custom Compression
Configure compression settings:
compression:
type: snappy # Options: snappy, gzip, brotli, zstd
level: 1 # Compression level (1-9)
block_size: 128MB
API Reference
Command Line Arguments
usage: parquet-converter [-h] [-o OUTPUT_DIR] [-c CONFIG] [-v] input_path
positional arguments:
input_path Path to input file or directory
optional arguments:
-h, --help Show this help message and exit
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory path
-c CONFIG, --config CONFIG
Path to configuration file
-v, --verbose Enable verbose logging
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
| csv.delimiter | string | "," | CSV file delimiter |
| csv.encoding | string | "utf-8" | File encoding |
| csv.header | int | 0 | Header row index |
| txt.delimiter | string | "\t" | TXT file delimiter |
| datetime_formats | list | ["%Y-%m-%d"] | Date format patterns |
| infer_dtypes | bool | true | Enable type inference |
| compression | string | "snappy" | Compression type |
| log_level | string | "INFO" | Logging level |
Roadmap
Planned features and improvements:
-
Short Term
- Support for more input formats (JSON, Excel)
- Enhanced data type inference
- Improved error handling
-
Medium Term
- Distributed processing support
- Web interface for file conversion
- Real-time conversion monitoring
-
Long Term
- Cloud storage integration
- Advanced data validation
- Custom transformation rules
Support
For issues and feature requests, please use the GitHub issue tracker.
Getting Help
- Check the Troubleshooting section
- Search existing issues
- Create a new issue with detailed information
- Join our Discussions for community support
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parquet_converter-0.1.0.tar.gz.
File metadata
- Download URL: parquet_converter-0.1.0.tar.gz
- Upload date:
- Size: 24.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a93a2244612890fa35d9e1b38c4710391a9bc215428b2efdfb580925f42596e8
|
|
| MD5 |
54cff5d87c64e9a5d4d0002814bddc70
|
|
| BLAKE2b-256 |
600f2d6cfdbd43fe9ef96dc8484c53d68dd6204721467d4c6d447007c7656329
|
File details
Details for the file parquet_converter-0.1.0-py3-none-any.whl.
File metadata
- Download URL: parquet_converter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ba8e76c69d3d7f429b5d53e7186e9a81b6a76e98c49c56182af03ad15c87eeb
|
|
| MD5 |
5c03f802270cfb5fab265ffc2ba91ecc
|
|
| BLAKE2b-256 |
0f8128da04e7132f1dd22b8d38a6f96ee221c0c8bb5426a0345b1b426452422b
|