Skip to main content

CLI for fast, flexbile concatenation of tabular data using polars.

Project description

PyPi CI GitHub stars DOI

joinem provides a CLI for fast, flexbile concatenation of tabular data using polars

Install

python3 -m pip install joinem

Features

  • Lazily streams I/O to expeditiously handle numerous large files.
  • Supports CSV and parquet input files.
    • Due to current polars limitations, JSON and feather files are not supported.
    • Input formats may be mixed.
  • Supports output to CSV, JSON, parquet, and feather file types.
  • Allows mismatched columns and/or empty data files with --how diagonal and --how diagonal_relaxed.
  • Provides a progress bar with --progress.
  • Add programatically-generated columns to output.

Example Usage

Pass input filenames via stdin, one filename per line.

find path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet

Output file type is inferred from the extension of the output file name. Supported output types are feather, JSON, parquet, and csv.

find -name '*.parquet' | python3 -m joinem out.json

Use --progress to show a progress bar.

ls -1 path/{*.csv,*.pqt} | python3 -m joinem out.csv --progress

If file columns may mismatch, use --how diagonal.

find path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal

If some files may be empty, use --how diagonal_relaxed.

To run via Singularity/Apptainer,

ls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem out.feather

Add literal value column to output.

find path/to/ -name '*.csv' | python3 -m joinem out.csv --with-column 'pl.lit(2).alias("two")'

Alias an existing column in the output.

find path/to/ -name '*.csv' | python3 -m joinem out.csv --with-column 'pl.col("a").alias("a2")'

Apply regex on source datafile paths to create new column in output.

find path/to/ -name '*.csv' | python3 -m joinem out.csv --with-column 'pl.lit(filepath).str.replace(r".*/(.*)\.csv", r"${1}").alias("filename stem")'

API

usage: __main__.py [-h] [--version] [--progress]
                   [--how {vertical,horizontal,diagonal,diagonal_relaxed}]
                   output_file

Concatenate CSV and/or parquet tabular data files.

positional arguments:
  output_file           Output file name

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --progress            Show progress bar
  --how {vertical,horizontal,diagonal,diagonal_relaxed}
                        How to concatenate frames. See <https://docs.pola.rs/py-
                        polars/html/reference/api/polars.concat.html> for more information.

Provide input filenames via stdin. Example: find path/to/ -name '*.csv' | python3 -m joinem
-o out.csv

Citing

If joinem contributes to a scholarly work, please cite it as

Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182

@software{moreno2024joinem,
  author = {Matthew Andres Moreno},
  title = {mmore500/joinem},
  month = feb,
  year = 2024,
  publisher = {Zenodo},
  doi = {10.5281/zenodo.10701182},
  url = {https://doi.org/10.5281/zenodo.10701182}
}

And don't forget to leave a star on GitHub!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

joinem-0.2.0.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

joinem-0.2.0-py2.py3-none-any.whl (5.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file joinem-0.2.0.tar.gz.

File metadata

  • Download URL: joinem-0.2.0.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for joinem-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6d57b50ece32578ebb1488268d4d9b88d4e292338d0c99b78e8d5c02c31263c9
MD5 2df61cb7293e67aa7aa20088f2120af0
BLAKE2b-256 84dd1f8dc588e0dc19ae50983ca77bbcd821b80c17695efd68ead71848c4e9ef

See more details on using hashes here.

File details

Details for the file joinem-0.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: joinem-0.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for joinem-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 567d6783622083f15612a5862f69451616886a4a9a1eebafc6a0357e979e47c8
MD5 add0291ff7c8c6147c1a0779a38d43a7
BLAKE2b-256 0364d70afc2ab0398e359d678d372a5d9e75d39d452406442d77d2a315c0150a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page