A tool to process and export datasets in various formats including ORC, Parquet, XML, JSON, HTML, CSV, HDF5, and XLSX.
Project description
PandasDatasetProcessor
PandasDatasetProcessor
is a Python package that provides utility functions for loading, saving, and processing datasets using Pandas DataFrames. It supports multiple file formats for reading and writing, as well as partitioning datasets into smaller chunks.
Features
- Load datasets from multiple file formats (CSV, JSON, XML, Parquet, ORC, HDF5, etc.).
- Save datasets in various formats including CSV, JSON, Parquet, ORC, XML, HTML, HDF5, and XLSX.
- Partition a DataFrame into smaller datasets for efficient processing.
- Custom error handling for incompatible actions, formats, and processing.
Installation
To install the package, you can use pip
:
pip install pandas-dataset-processor
Usage Example
1. Importing the package
import pandas as pd
from pandas_dataset_processor import PandasDatasetProcessor
2. Loading a dataset
You can load a dataset using the load_dataset
method. It will automatically detect the file format based on the extension.
dataset = PandasDatasetProcessor.load_dataset('path/to/your/file.csv')
3. Saving a dataset
To save a DataFrame in a specific file format, use the save_dataset
method. You can specify the directory, base filename, and the format (e.g., CSV, JSON, Parquet, etc.).
PandasDatasetProcessor.save_dataset(
dataset=dataset,
action_type='write', # action type should be 'write' for saving
file_format='csv', # file format such as 'csv', 'json', 'parquet', etc.
path='./output', # path where the file will be saved
base_filename='output_file' # base filename for the saved file
)
4. Partitioning a dataset
You can partition a dataset into smaller DataFrames for distributed processing or other use cases:
partitions = PandasDatasetProcessor.generate_partitioned_datasets(dataset, num_parts=5)
Example Code
import pandas as pd
from pandas_dataset_processor import PandasDatasetProcessor
dataset_1 = pd.read_csv('https://raw.githubusercontent.com/JorgeCardona/data-collection-json-csv-sql/refs/heads/main/csv/flight_logs_part_1.csv')
dataset_2 = pd.read_csv('https://raw.githubusercontent.com/JorgeCardona/data-collection-json-csv-sql/refs/heads/main/csv/flight_logs_part_2.csv')
file_formats = ['orc', 'parquet', 'xml', 'json', 'html', 'csv', 'hdf5', 'xlsx']
datasets = [dataset_1, dataset_2]
# Example usage
file_locations = []
# Save datasets in multiple formats
for index_dataset, dataset in enumerate(datasets):
for index_file, file_format in enumerate(file_formats):
path = f'./data/dataset_{index_dataset+1}'
base_filename = f'sample_dataset_{index_file+1}'
file_location = f"{path}/{base_filename}.{file_format}"
file_locations.append(file_location)
PandasDatasetProcessor.save_dataset(
dataset=dataset,
action_type='write',
file_format=file_format,
path=path,
base_filename=base_filename
)
# Load the saved files
for file_location in file_locations:
PandasDatasetProcessor.load_dataset(file_location)
# Generate partitioned datasets
partitions = PandasDatasetProcessor.generate_partitioned_datasets(dataset_2, 7)
Error Handling
The package raises custom exceptions for handling different error scenarios:
read_orc()
is not compatible with Windows OS.IncompatibleActionError
: Raised when the specified action is not supported (e.g., trying to read a dataset when an action to write is expected).IncompatibleFormatError
: Raised when the file format is not supported.IncompatibleProcessingError
: Raised when neither the action nor the format is supported for processing.SaveDatasetError
: Raised when an error occurs while saving a dataset in a specific format.LoadDatasetError
: Raised when an error occurs while loading a file in a specific format.
Exception Handling Example
try:
PandasDatasetProcessor.save_dataset(dataset, 'write', 'xml', './output', 'example')
except SaveDatasetError as e:
print(f"Error saving the dataset: {e}")
except IncompatibleFormatError as e:
print(f"Unsupported format: {e}")
except IncompatibleActionError as e:
print(f"Unsupported action: {e}")
except IncompatibleProcessingError as e:
print(f"Processing not supported: {e}")
License
This package is licensed under the MIT License. See the LICENSE file for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pandas-dataset-handler-0.0.1.tar.gz
.
File metadata
- Download URL: pandas-dataset-handler-0.0.1.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29d8af8114e3a1591bc8b0ffb4db5a7f8d9130848de3d20f8a714cd91413aeac |
|
MD5 | d6fa22b44a2db2ae56b99e00efcf50df |
|
BLAKE2b-256 | 6f7b4fc2d1c60d035f4ba00d8071333adfded1adf6ff87ee320ebf90dab9860b |
File details
Details for the file pandas_dataset_handler-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: pandas_dataset_handler-0.0.1-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f39547707c7309a3cf5803b8052e6662d661412c65d3a2d99e74648ab7cc2f6e |
|
MD5 | 4a953dc67a06a9ac217345c9681c24fd |
|
BLAKE2b-256 | 87942f1e8ee7b067ae9ec948411081f04655e3adab70139df678980d65057fe9 |