CLI for fast, flexbile concatenation of tabular data using polars.
Project description
joinem provides a CLI for fast, flexbile concatenation of tabular data using polars
- Free software: MIT license
- Repository: https://github.com/mmore500/joinem
- Documentation: https://github.com/mmore500/joinem/blob/master/README.md
Install
python3 -m pip install joinem
Features
- Lazily streams I/O to expeditiously handle numerous large files.
- Supports CSV and parquet input files.
- Due to current polars limitations, JSON and feather files are not supported.
- Input formats may be mixed.
- Supports output to CSV, JSON, parquet, and feather file types.
- Allows mismatched columns and/or empty data files with
--how diagonal
and--how diagonal_relaxed
. - Provides a progress bar with
--progress
. - Add programatically-generated columns to output.
Example Usage
Pass input filenames via stdin, one filename per line.
find path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet
Output file type is inferred from the extension of the output file name. Supported output types are feather, JSON, parquet, and csv.
find -name '*.parquet' | python3 -m joinem out.json
Use --progress
to show a progress bar.
ls -1 path/{*.csv,*.pqt} | python3 -m joinem out.csv --progress
If file columns may mismatch, use --how diagonal
.
find path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal
If some files may be empty, use --how diagonal_relaxed
.
To run via Singularity/Apptainer,
ls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem out.feather
Add literal value column to output.
find path/to/ -name '*.csv' | python3 -m joinem out.csv --with-column 'pl.lit(2).alias("two")'
Alias an existing column in the output.
find path/to/ -name '*.csv' | python3 -m joinem out.csv --with-column 'pl.col("a").alias("a2")'
Apply regex on source datafile paths to create new column in output.
find path/to/ -name '*.csv' | python3 -m joinem out.csv --with-column 'pl.lit(filepath).str.replace(r".*/(.*)\.csv", r"${1}").alias("filename stem")'
API
usage: __main__.py [-h] [--version] [--progress]
[--how {vertical,horizontal,diagonal,diagonal_relaxed}]
output_file
Concatenate CSV and/or parquet tabular data files.
positional arguments:
output_file Output file name
options:
-h, --help show this help message and exit
--version show program's version number and exit
--progress Show progress bar
--how {vertical,horizontal,diagonal,diagonal_relaxed}
How to concatenate frames. See <https://docs.pola.rs/py-
polars/html/reference/api/polars.concat.html> for more information.
Provide input filenames via stdin. Example: find path/to/ -name '*.csv' | python3 -m joinem
-o out.csv
Citing
If joinem contributes to a scholarly work, please cite it as
Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182
@software{moreno2024joinem,
author = {Matthew Andres Moreno},
title = {mmore500/joinem},
month = feb,
year = 2024,
publisher = {Zenodo},
doi = {10.5281/zenodo.10701182},
url = {https://doi.org/10.5281/zenodo.10701182}
}
And don't forget to leave a star on GitHub!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file joinem-0.2.0.tar.gz
.
File metadata
- Download URL: joinem-0.2.0.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d57b50ece32578ebb1488268d4d9b88d4e292338d0c99b78e8d5c02c31263c9 |
|
MD5 | 2df61cb7293e67aa7aa20088f2120af0 |
|
BLAKE2b-256 | 84dd1f8dc588e0dc19ae50983ca77bbcd821b80c17695efd68ead71848c4e9ef |
File details
Details for the file joinem-0.2.0-py2.py3-none-any.whl
.
File metadata
- Download URL: joinem-0.2.0-py2.py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 567d6783622083f15612a5862f69451616886a4a9a1eebafc6a0357e979e47c8 |
|
MD5 | add0291ff7c8c6147c1a0779a38d43a7 |
|
BLAKE2b-256 | 0364d70afc2ab0398e359d678d372a5d9e75d39d452406442d77d2a315c0150a |